Correlation and Causation in Big Data

Big data began as a term used when you have extremely large data sets, These big data sets cannot  be managed nor analyzed with conventional database programs not only because  of the size exceeding the capacities of standard data management , but also because of the variety and unstructured nature of the data (it comes from different sources as the sales department, customer contact center, social media, mobile devices and so on) and because of the velocity at which it moves (imagine what it entails for a GPS to recalculate continually the next move to stay on the best route and avoid traffic jams: looking at all traffic information coming from official instances as well as from other drivers on real time, and transmitting all the details before the car reaches a crossroad).

The term ‘Big Data’ is also used to identify the new technology needed to compute the data and reveal patterns, trends, and associations.  Furthermore, this term is now synonym of big data’s analytical power and its business potential that will help companies and organizations improve operations and make faster, more intelligent decisions.

What is big data used for?

First and the more evident part is to do statistics: how many chocolates have we sold? What are the global sales around the world, splitted per country? Where do the customers come from?

Then correlation comes to play:  things that have the same tendency, that go together or that move together: countries that are strong on chocolate sells also have  a lot of PhDs.

Thanks to http://tylervigen.com

Thanks to http://tylervigen.com

Correlation is not causality. It’s not because you eat chocolate that you become a PhD (nor the other way around, having a PhD doesn’t mean you are more likely of loving chocolate).  Analyzing correlations is still a big deal.  It can be a conjunction, like with thunder and lightning. It can be a causality relation, and even when there is causality, it is hard to say the direction of the relationship, what is the cause and what its effect.  Nevertheless, big data predictive behaviour analysis is doing a great job, even when the ‘why’s behind it, the underlying causes, are still hidden, not explained.

The great potential in Big data is that it helps us discover correlations, patterns and trends where we couldn’t see them before, but it’s up to us to create theories and models that can explain the relations behind the correlations.