Life of PII

A coworker wrote to me yesterday, wondering about the process for enabling a credit freeze with Equifax. He pointed out that those requesting a credit freeze must enter their date of birth and social security number online, and Equifax has demonstrated that it cannot keep this information secure!

If the notion of sending your personally identifying information (PII) to a place like Equifax makes you uneasy, you’re right to be nervous. Unfortunately, the problem is more serious than a handful of vendors who verifiably cannot protect our identifying information.

There are several misunderstandings about PII that, until they are addressed, assure that each of us is at risk for identify theft, or worse.

The first misunderstanding is that PII can definitely be protected.  Organizations often brag about their data security, and on more than one occasion some of those same organizations have sent me, via unsecured email, data sets simply oozing with personal information. Presumably, they felt I could be trusted to protect what I received.   But as for everyone in between who might have read those emails, who can say?

It’s a start, as security experts recommend, to only store PII in an encrypted format that has no outside use.   And that (almost) assures that no one can just read our social security numbers from a database once they break in.  But as my coworker pointed out, there is still risk:  for an external person to confirm their identify, they must submit PII, and for at least some period of time our PII is unencrypted, and vulnerable.

As for other data security: this is only as secure as the least secure access to that data, and we should admit that this means not very secure.   One slip – like the well-intentioned people who sent me PII in an email – and the game is up.

I’ve learned from security pros, and my own experience: the First Law of Security is that nothing is truly secure.  When thinking about security, we should never start a sentence with “An attacker could never…,”  because they almost certainly will, if it’s worth their trouble.


The next understanding is that PII is well-defined – that if we encrypt unique identifiers (like social security numbers) so no outsider can use them, we’ll be in pretty good shape.

Regrettably, this is not the case.  Personal attributes like age, or gender, or zip code, and income bracket taken in combination may serve almost as well as a social security number. We cannot usually encrypt these attributes, as they’re useful for presentation and analysis.  If a bad actor finds information suggesting that we’re worth the trouble of an attack, a combination of human-readable personal attributes may very well be “good enough” – for being in a very small group is little different than being uniquely identified.  Consider what is now available in online public records, and remember the First Law…

If in a data set of personal information, there is any combination of unencrypted attributes that generates a small group of records, that’s effectively PII – it means individuals are at risk, for we cannot know what information outside of our system might be used to resolve our identity completely.


Perhaps the most crucial misunderstanding about PII is our presumption that PII is useful for anything other than confirming identity – i.e. authentication.  When it comes to analytics and business intelligence, PII should really stand for “Probably Is Irrelevant.”

What analyst needs to know someone’s date of birth?  There are many things that correlate with age, but few that correlate with whether we’re a Capricorn or a Sagittarius.   And social security numbers?  Anything this attribute can tell us, there are better ways to go.

Authentication using PII is a process that can be made nearly secure, using encrypted information wherever possible.    But once human-readable attributes – either singly or in combination, can come close to identifying us, we should know there is a security problem waiting to happen.   Translation: liability!  No organization has yet, to my knowledge, been forced into bankruptcy by liability from a PII breach, but that time may not be far off.  If we take the First Law seriously, this corollary also applies: for planning purposes, all potential data breaches should be regarded as actual breaches.

In information delivery and modeling, we are usually oblivious to the potential cost and risk associated with our input data, but these are considerations that should become part of our world view.   The inadvertent delivery of “effective” PII can be trapped.   In modeling, we frequently use data like dates of birth that are far more precise than what is necessary to build a suitable model.   Even if the model results do not expose PII, the presence of input data sets in our organization holding  actual or effective PII presents a risk.  If data is present, someone will find it, and use it – potentially in a fashion that we won’t like.

The best way to protect PII is not to use it at all – to deliver models and visualizations that never use actual or effective PII in the first place.

Does this complicate modeling and information delivery?  Sure.  For quantitative models, it means an optimal model must meet its predictive requirements, while limiting the precision of the data it employs.  At a minimum this means an embedded optimization, and you know what’s coming next…. sometimes, we won’t be able to meet both the requirement and the constraints.   C’est la guerre.

About five years ago I predicted that within five years many of us analysts would find ourselves engaged in security-related work.   That prediction has not yet come to pass, but I still think it will, and before too long.  The best protection against breaches of valuable data is not to encrypt it, or to protect it, or to otherwise make it difficult for a hacker to get to it – history tells us that those methods ultimately fail.   These methods are valuable, and they do slow attackers, but ultimately they require a strategy of perfect defense, against attackers whose weapons are always improving.  At some point,  the defense will be scored upon.

The best way to protect against a data breach is to limit the use of data we don’t need, or even to insist that some sensitive data are off limits. Analytics practice, which has had rather little to say about limiting data usage in the past, should have a great deal to say about limiting data usage in the future.

2 thoughts on “Life of PII

  1. I think this piece makes a number of important points. I would add that the same desire to make the system highly efficient also leads to the system’s high vulnerability. If to get a new credit card required going into the local bank and showing a photo ID, there would be far less credit card fraud. But such a system would also be far less efficient than today’s application process. So as the author points out, we need to develop a system that lessens vulnerability and as best as possible doesn’t compromise on efficiency. But I agree with the author that the current system tilts too much toward efficiency without really thinking through the consequences of this system on vulnerability. Further to the author’s point, one’s PII are surely known by far more people than just this individual and with a bit of work can probably be learned by any half decent hacker. At which point, anyone’s PII becomes far from personal except in terms of liability (e.g., damage to one’s credit rating). That is, now anyone can use these numbers to open up a credit card account so PII has become required information to open an account but information that could be associated with anyone.


  2. I completely agree that a different system/methodology is needed. here seems to be an issue of gate keepers and number of gates when it comes to protecting one’s information. In addition, as the author points out, there is an issue of how get through the gate (i.e., authentication). On this, it seems that we need to return to a system that makes use of an authentication source similar to how a photo ID was used in the pre-internet world. Yes, there is still the issue that one can counterfeit photo IDs, but doing so seems much harder. In today’s world where there is far less face-to-face interactions, what about relying on voice recognition for authentication? Also, embodied in the voice recording could be a user selected passcode phrase.

    As the author states, much of what firms use for PII doesn’t actually truly identify a person. A SSN is merely a number assigned to each individual and has little to do with who the person is (yes there is some information related to location of birth and DOB), but there is a fair bit of randomness to the number. Compared to hair and eye color, height, finger print, voice, etc., a SSN has little identifying information. So if we want to actually identify a person, shouldn’t we use information that actually identifies?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s