We don’t like ambiguity much. We see weather reports, financial reports, economic forecasts – to name a few – all offering precision far beyond what is supported by the data. My impression is that we encourage this artificial fidelity through our expectation that when we ask a question, we should arrive at a single answer.
Unfortunately, a single answer is often unavailing. In analytics we have a two-fold problem regarding precision. First, we can only know the answer to any question with limited precision. In addition, the questions we naturally ask are often imprecise – a feature we don’t always account for in our analytics efforts.
Here is a specific example of a “simple” question that isn’t as precise as it might appear. I take it as axiomatic that we should represent our human diversity in advisory boards, cabinets and so on. However, I also wondered:
How many people does it take to represent human diversity? (Just for fun, go ahead and take a guess before you read down. I took a wild stab before I looked at this. “I don’t know – maybe a couple of dozen people?”)
How many people… is a normal question, but I can’t get a number from that, until diversity and representation have definitions. Using US Census and Pew Research Center work, I enumerated diversity with these categories: two physical genders, four sexualities, six major race/ethnic groups, and five religions. As for “representation,” there isn’t really one definition. Do I want every combination of gender and sexuality and race/ethnicity and religion? Or, could we just require that each category (but not each combination) be covered? That’s not for me to say, is it? So this isn’t one question at all – it’s at least two, bracketed by these questions:
(1) How many people are needed to create a diverse group (with the categories defined above), if every combination of category values appears?
(2) How many people are needed to create a diverse group (as defined above), if every value in each category appears at least once (but not necessarily each combination)?
There is a female Christian in the answer to the first question. In the answer to the second, there is a female, and a Christian – but maybe not both together in the same person.
These two questions have definite answers. For the first question, to get all the combinations we’ll need 240 people – just multiply the number in each category together. In the second question, the answer is 6 people (one for each race/ethnicity – the other independent categories have fewer values and can be covered once race/ethnicity is filled in).
6 versus 240. That’s a big difference, resulting from the imprecision in our original question, which yielded two precise questions with definite answers. In fact, we should add more precise questions, to allow for the legitimate exclusion of very unusual demographic combinations. (By the way, exclusions don’t necessarily change the answer to the second question.).
I know… Ask a question and not only do we get a huge number of new questions, there is a range of 234 people in the possible answers.
This might look pretty bleak if precision is the goal, but we have in fact accomplished something. We’ve rendered our original question into a list of precise questions with definite answers – i.e. starting with (2) above, and adding any required exclusions.
If discourse allows us to agree on 1) our definition of diversity; and 2) a modest set of exclusions, we could arrive at a final question with an answer equal to, or very near, six. But analytics on its own cannot resolve the final question for us – that requires dialogue and agreement. If we do not narrow the field of possible questions, the answer to our original question remains as it is: between 6 and 240, inclusive. In practical problems it’s easy for us to fly by this part, to get to a definite answer.
You might ask: was there something “wrong” with our original question, with its large range of answers? No. It’s just a question – as it turns out, a question without a definite answer. Analytics doesn’t take a position on precision – we’re the ones doing that. If we desire precision and can offer a protocol that might allow us to narrow the field of possible questions and answers, we’ve made progress.
If you’re now thinking that challenges for practical analytics studies are assuring that we render a natural question accurately, and that we resolve precision appropriately – let me just say this. I agree, and I’ll return to that question in upcoming posts.
Details on the diversity example.
Any choice of diversity will have its limitations, supporters, and detractors. I started with the terminology and categories from the US census (www.census.gov) and Pew Research Center (www.pewforum.org), with modifications as follows:
- I enumerate sexuality by gay/straight/bisexual/transgender rather than straight/gay/lesbian/bisexual/transgender, to assure gender and sexuality are independent categories.
- For religion, I tracked Christian, Unaffililiated, Judiac, and Muslim, Hindu
- Unlike the US Census, I used a combined race/ethnicity category, with Hispanic as one of the options.
The result is:
Gender (2): Male, Female
Sexuality (4): Bisexual, Transgender, Gay, Straight
Race/Ethnicity (6): White Americans, Black Americans, Asian Americans, Native Americans and Alaska Natives, Native Hawaiians and Pacific Islanders, Hispanic.
Religion (5): Christian, unaffiliated, Judiac, Muslim, Hindu
When looking at all possible combinations (first question) we have each gender, and then four possible sexuality categories, and then six possible races, and finally four possible religions. That’s 2 * 4 * 6 * 5 = 240 combinations.
When looking at the second question (assure each category is covered) I did not invoke an (integral) optimization protocol for this starter study. We just need an example, so it’s easier to make a grid and just fill it in. Start with the category having the most values. Then – as long as the categories are independent –the other categories can also be covered. Here is an example:
I excluded the African-American (“Black American”)/Judaic combination in making up this example, as it is rare enough that it would likely be excluded in practice, to show that the second question is under-constrained.