One day I was waiting in the Miami airport for my flight home, trying not to listen to a personal conversation going on nearby. I wasn’t entirely successful. A young couple was discussing their future, and it was not going well. While he wanted to continue their relationship, she was ready to wrap things up. Finally, he asked her if she thought there was any chance for their future relationship, and her response was memorable. “The future,” she told him, “holds many unknowns.”
Indeed it does. And the future has been in the news again, perhaps triggered by a confluence of raging populists, Alvin Toffler’s death, the increasing availability of data, and of course the presumptive takeover of planet Earth by cognitive robots. That’s a lot of uncertainty. And with uncertainty, our interest in future predictions naturally increases. Bring on the futurists.
Futurist writing seems to cover a lot of ground, ranging from inspired guesswork to what is very much like good science fiction – “what-if fiction” to use Ursula Le Guin’s phrase – informed speculation, premised on a potential aspect of our future world.
It is tempting, and attempted, to assert that our ability to predict the future is only limited by computing power and our ability to process data. Load in all the data, leverage our increasing big-data capabilities, and the predictions shall inevitably flow.
As tempting as that may be, it is also largely a misconception. I want to explain why, because understanding this leads to better and more reasonable expectations for our data-related outcomes. When we want to predict the future – whether we’re thinking about the chance of rain next week, the growth of a business division, the role of artificial intelligence in the next decade, or whether our daughters will go to Harvard, we must – explicitly or implicitly – do three things. First, we pick a system of reference. precipitation, artificial intelligence, the business division, our daughters. Second, we construct a model for how the outside world will interact with our system. There has to be a model, because we can’t know or track everything that might impact our system. We might default to assuming that the outside world doesn’t matter, but that is rarely legitimate. Practitioners of statistical mechanics often model the impact of outside forces using stochastic methods, and that’s a good approach when the time scale of interest is much longer than the characteristic time of those outside interactions. (For what I still like as one of the best introductions to stochastic interactions, see S. Chandrasekhar Stochastic Problems in Physics and Astronomy, Reviews of Modern Physics Volume 15, Number 1.) Finally, we put our knowledge of the system together with our outside-world model and attempt to craft a future trend. Our basis might be empirical, or enabled by a physically-based equation of motion.
Simply collecting more data about our system has a limited impact on future predictions, when our system interacts with unknown and unknowable outside forces. Deterministic (non-random) mechanics – required for any definite prediction – are unavailing unless we can define an entire system that has little relevant interaction with the outside world.
Here’s an example. Take a soccer ball – that’s going to be our system. It weighs about 450 grams, and assuming it is made largely from biostuff or carbon-based materials, it has an average molecular weight of about 14g/mole. That translates to roughly 2 x 10^25 atoms. Let’s say that we decided to store minimal information about every soccor-ball atom in our data system – at 50 bytes per atom, that would be around one trillion petabytes of data. A petabyte (about 10^15 bytes) is still a lot these days, meaning: we have a ton of data. Nonetheless, let’s say we’ve stored it, somehow – now we have stored everything I might want to know about our ball.
Now, let’s ask this: where is the ball during the course of a soccer game? All that data about our system is pretty much irrelevant, because a bunch of soccer players are going to be kicking our system every which way for 90 minutes. The outside-world interactions with our system dominate this problem, and all the system data we’ve stored doesn’t matter at all.
What about storing everything in the outside world too? Good luck – we’re talking about all of the physiology and 100 billion neurons per player, at one instant in time, plus the ability to convert that into their trajectories over 90 minutes’ time. Even if the data could be stored, the problem is very likely ill-conditioned: a tiny variance in initial data means a large variance in long-term outcomes.
In short: we’re not going to answer this question. We’re also not going to figure out if our newborns are going to Harvard, for similar reasons. Or (at least in these parts) whether it is going to rain at 3 pm next Friday.
Does that mean we can’t predict future events? No – we have a limited capability, but we do want to understand those limits. In the very short term our system may have limited outside interactions, and we can do a pretty good job of predicting its future state. Thinking about our ball again, right after our soccer ball gets kicked, we know about where it will go. Longer term, if our system has strong and unknowable outside interactions our conclusions will be more limited, and we’ll typically want to recast our question to accommodate that. For example, while we don’t know exactly where the soccer ball is going to be during the game, we can say about where it is with near certainty – it’s on the playing field. And, we can say where it is after the game – in the field manager’s locker. That may not seem like much, but if our question was “Where on Earth (literally) is my soccer ball?” we’ve actually done pretty well. It’s a big planet, after all. And the amount of data I needed to arrive at this answer? Very little – there were no computers required at all. Similarly we might not be able to predict the exact time and place of a rainstorm next week, but we can do a reasonable job of assessing whether rain is likely (and roughly how likely) in the next week or two.
Our system definition, our model for its outside interactions, the granularity of our question, and our knowledge of how the composite system+world evolves in time determine the quality of a future prediction. When we don’t have a good feel for any one of these elements, a future prediction is dicey. The role of artificial intelligence – “AI” – in our future world qualifies as dicey right now – we have a much better understanding of how AI systems are likely to evolve (although that isn’t certain), than we do about the how AI systems will interact with our world. So we have to bracket the possibilities and see how we might handle them, for now.
What is the role of big data and analytics in making better future predictions? Well – ahem – that’s a little hard to predict, but I can say what I’d like to see. This is not about more data alone – simply loading more data about our system is often no guarantee of a better prediction. However, a critical challenge faced by scientists and data teams is identifying the right system in the first place. We want to select a system that is small enough to understand, but either decoupled from outside influences, or subject to a simple and manageable model of those outside influences. Convenient processing of very large data sets should improve our ability to empirically define those ideal systems – using a wide enough net to manage outside interactions, and setting aside excess data that is irrelevant to our current line of inquiry.