We’re Scientists

Being young means never having to say that moving is a problem: you simply take anything you own of value, dump it in a van, and set off to your next living quarters.

So when my brother graduated from college and needed to move to graduate school, we rented a van, dumped his minimal belongings (including a handcrafted telescope) therein, and hit the road early one summer morning.

After about 10 hours of driving through the flatlands of lower Wisconsin, Illinois, Indiana, and Ohio, we decided that it was time for dinner, and dinner time had brought us to I-70 in the vicinity of Dayton, Ohio and the Wright-Patterson Air Force base.   We stopped, walked into a nearly empty diner, and were seated by our talkative waitress. I imagine she was just looking for a little light chat to alleviate a dull and uninteresting work shift, and it wasn’t her fault that after 10 hours of driving, we were two obviously tired people who could not hold up our end of any conversation, small talk or otherwise.

Still, she gave it her best shot, with this opening:  “Are you boys from the base?”

Tired or not, my brother was not a person to allow a genuine Dan Ackroyd moment to pass. He looked her in the eye.

“No ma’am.  We’re scientists.

From that moment, she focused our conversation on what was required to see us served, which served us right, I suppose.

………….

Also sometimes puzzling is the designation data science, which as data scientist friends of mine all point out, is a very loosely defined term.  That’s OK by me – I kind of like that it’s loosely defined.  After all, trying to separate science from engineering is a little like trying to separate art from craft, with the only probable result being that those finding themselves on the engineering or craft side of the definition will become aggrieved.   But this does beg the question of whether analytics should be considered a “science,” like physics or chemistry.

Conventional science, at least, can be defined as the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment.

By those lights, a lot of analytics is science, as long as you are willing to mentally stretch what encompasses the “physical and natural world.”

Conventional scientists tend to divide as experimentalists and theoreticians.  The former create controlled conditions to give well-defined and unbiased data answering clearly-defined questions with well-understood assumptions.  The latter create conceptual systems explaining the data that experimentalists provide.  Some scientists take both roles, but the skills are different and most specialize in generating well-understood data, or understanding those results in a conceptual framework.   Both are essential, but rather like the economic argument that labor comes before capital, data come before theory:  with no good data there is nothing to explain. Careful experimentalists provide data that is good in actuality as well as appearance.

When I see people talk about data science, or present their skills at meetings and conferences, the majority focus on the data analogue to conventional theory: models – explanatory,  predictive, or optimization in their varied forms. It’s often engaging work, but just as in conventional science, a model is only as good as the underlying data, and only as meaningful as the set of assumptions and conditions under which the data have been generated.   And just as in conventional science, good data come before good theory.

Ironically,  with modern tools it is often not difficult to craft respectable models from a self-consistent data set.  What is often more difficult is the “experimental” aspect of data science: understanding whether our underlying data are truly accurate or precise, if they are validated to external reality, what questions they answer, and the assumptions that went into their generation.   Our data systems store numbers with ease – there are scores, costs, and counts galore.  However those same systems store the context for those numbers with much less ease, so we don’t always know what our numbers represent.  Challenging that the numbers we have actually give us the answers we want might reasonably be called “experimental” data science – and as with conventional experiments, data come before theory.

Exploratory analysis certainly has a role in experimental data science, but much of data-experimental work is the old-fashioned grind of iterative data validation, requirements gathering, uncertainty analysis, and predictor development.   It that less exciting than predictive modeling? Probably.  Is it more critical than modeling?  Yes. Without data whose meaning is understood by all stakeholders, there is little point in a model.

We hear a great deal about the shortage of data scientists, and I wouldn’t argue the point.  But the frequent origin of poor data-driven decisions – and they’re out there – is a poor understanding of original numbers we have, and then taking those numbers to mean something they actually do not.  I don’t know any experienced data scientist who doesn’t have a “misunderstood data yielding a bad decision” story to tell.  More than we need better models, we need better understanding of the data going into those models. For models can be made from irrelevant data as easily as appropriate data – only uncertain, inconsistent, or poorly-represented data really thwart model-making, and that’s different than “irrelevant.”  What we really need are more data experimentalists – it’s the best kind of data science there is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s