Correlation and Causation in Big Data

Big data began as a term used when you have extremely large data sets, These big data sets cannot  be managed nor analyzed with conventional database programs not only because  of the size exceeding the capacities of standard data management , but also because of the variety and unstructured nature of the data (it comes from different sources as the sales department, customer contact center, social media, mobile devices and so on) and because of the velocity at which it moves (imagine what it entails for a GPS to recalculate continually the next move to stay on the best route and avoid traffic jams: looking at all traffic information coming from official instances as well as from other drivers on real time, and transmitting all the details before the car reaches a crossroad).

The term ‘Big Data’ is also used to identify the new technology needed to compute the data and reveal patterns, trends, and associations.  Furthermore, this term is now synonym of big data’s analytical power and its business potential that will help companies and organizations improve operations and make faster, more intelligent decisions.

What is big data used for?

First and the more evident part is to do statistics: how many chocolates have we sold? What are the global sales around the world, splitted per country? Where do the customers come from?

Then correlation comes to play:  things that have the same tendency, that go together or that move together: countries that are strong on chocolate sells also have  a lot of PhDs.

Thanks to

Thanks to

Correlation is not causality. It’s not because you eat chocolate that you become a PhD (nor the other way around, having a PhD doesn’t mean you are more likely of loving chocolate).  Analyzing correlations is still a big deal.  It can be a conjunction, like with thunder and lightning. It can be a causality relation, and even when there is causality, it is hard to say the direction of the relationship, what is the cause and what its effect.  Nevertheless, big data predictive behaviour analysis is doing a great job, even when the ‘why’s behind it, the underlying causes, are still hidden, not explained.

The great potential in Big data is that it helps us discover correlations, patterns and trends where we couldn’t see them before, but it’s up to us to create theories and models that can explain the relations behind the correlations.

Alex Pentland’s article on Data-Driven Society

I recently got the new issue from Scientific American (October 2013), and in the front page was announced the article ‘The Data-Driven Society’ by Alex Pentland.  I just had to read it 🙂

He co-leads the World Economic Forum on Big Data and Personal Data initiatives.  He was talking about all the digital bread crumbs we leave behind on our daily life (like gps and gsm info, or electronic payments) and what can be done with it.

With his students of the MIT Human Dynamics Laboratory, he is discovering mathematical patterns through data analytics that can predict human behaviour. ‘Bread crumbs record our behaviors as it really happens’ he says, it is more accurate than the information from social media, where we choose what we want to disclose from ourselves.  Alex and his team are in particular interested in the patterns of idea flows.

Among the most surprising findings that my students and I have discovered is that patterns of idea flow (measured by purchasing behavior, physical mobility or communications) are directly related to productivity growth and creative output.

Analysing those flows, he uncovered 2 factors that have a positive pattern of healthy idea flow:

  • engagement: connecting to others, usually in the same team or organisation, and
  • exploration: going abroad to exchange ideas.

Both are needed for creativity and innovation to flourish.  To find those factors, he based his research on graphs of different types of interactions, like person-to-person, emails, sms..

We may not have the tools he used (like an electronic badges for tracking person-to-person interactions) but intuitively this is something we know, a good communication is essential for the success of a team, but talking to an external person may provide a new insight.  It’s always good to be proved right, isn’t it?

Check my next post, I’ll continue with his article, there are a lot of great concepts he is presenting as the ‘new deal on data’ for personal data protection.


Snowden showed us the dangers of Big Data with PRISM, are we up to the challenge to steer its use?

A television screen shows former U.S. spy agency contractor Edward Snowden during a news bulletin at a cafe at Moscow’s Sheremetyevo airport June 26, 2013. Credit: Reuters/Sergei Karpukhin


As we already discussed on my Big Data presentations,   being able to analyse the amount of data that traces all our actions and movements is a great opportunity to improve our lives, as much as to do business, but it can also be exploited for the worst.  Now Edward Snowden has put a clear case under the spotlights, will this make us move? Will this lead to change?

It’s time to consider what ethical codes and regulations can be issued, so that this excellent opportunity that technology is putting in our hands, that is sharing, measuring and extracting knowledge from all aspects of our lives, is not misused.

The Bad Side of Personalisation

As I mentioned in my previous post, there is a vast amount of data available on the Internet, a lot of potential information. It is fantastic all the insights we can get from it not only for our businesses, but also for us as consumers, as users of Internet.

Lately, when you look for something on  Google, as it is getting ‘wiser’ and better at guessing  what you aim for,  you are just presented quicker with the right information, and this is super,  isn’t it?   Also, if you have a new hobby, and you are being targeted by ads related to it, well… even though it may not be good for your wallet, I’m sure you will be enjoying your newly bought gadgets.   All applications are trying to ‘personalise’ their interactions, to be more specific in order to get your attention.

As always, good things can have a drawback side.  When you search on a subject and the results that are shown to you are the ones that better match your centers of interest, then, you are left out of other diverse information on the subject.  Yes, you may say this ‘other’ results are surely there if you scroll for it… on the third page maybe?  But who looks even to the second page of results nowadays?

So, on the one hand, results match potentially what we are looking for, but on the other hand, we are not being shown the ‘other side’ of the world.  We are not being presented with other points of view.   And this is critical. I loved this TED talk by Jonathan Haidt about The moral roots of liberals and conservatives that makes a clear case of this point.  Diversity is a good thing, and has to be promoted, not making its access more difficult.

 Knowing this, the best we can do about it, for the sake of humanity, is to draw the attention of apps designers so that they think of a way to balance their algorithms to avoid this pervert effect of personalization.  Spread the word!



Visualising Big Data

With all the generated electronic data, there is a new way of studying sociology.  We can measure now what’s happening in real time on a particular event.  It means a lot of computation, the new techniques to navigate on extremely large data-sets have to be used, but there is also a challenge on how to present the results of all these analysis.  If you have a report of 150 pages full of numbers, it is not easily presented to the general public.

Fortunately, there are new ways of presenting results than traditional diagrams, tools that allow visualising complex statistics or concepts.  Look at the interactive graphic made by JESS3 on the article from The New York Times: Four Ways to Slice Obama’s 2013 Budget Proposal

Visualisation is becoming more and more important, to make people understand this level of data.  Proliferation of content makes it difficult to make sense of it, visualisation is putting it in a way that is digestible.  JESS3 transformed a 150 pages economic report  in a 6 minutes automation presentation.  It has been presented in a forum as a video, and presenters could see from the posture of the public that they were captivated, following this presentation.

Check an interview made to Leslie Bradshaw, co-founder of JESS3, by Google Developers Live series GDL Presents: Women Techmakers with JESS3


The semantic web: jumping on a graph

I have been very busy lately, with no time to write.  But I attended great seminars meanwhile, like Pierre De Wilde from Tinkerpop talking at the GBI about the Property Graph model.  He explained how we can construct and consult information in a graph database (‘traverse a graph’).

A Graph database is composed of vertices and edges. Each vertice has a name (id) and some  properties, and the edges have also a name and can also have properties, and as they have a direction, there is one identified Outgoing vertice and one identified Ingoing vertice.  In this  example:

v(1) is the vertice named ‘1’, and  there is for example the edge 9 with label=‘created’ that has as ingoing vertice v(1), and as outgoing vertice v(3).  So the question in English:

Who worked with author 1 in any of his books?

will look like this in the traversal language Gremlin:

 g.v(1).out(“created”).in(“created”)  àv(1),v(6),v(4)

This can be read as: Take the graph, look for vertice 1 and follow all outgoing edges with name ‘created’.  From all the vertices that you have reached, follow the ingoing edges with name ‘created’ and return the name of vertices you have reached.  The result is 1, 6 and 4.

Project DBpedia is a great example of a graph database, it began in 2007 and is now linking more than 3.64 million things from Wikipedia data in a graph.  Now in order to work and be able to navigate through the metadata, it has to be cleaned, standardise.  What do I mean? If you have a property called ‘place of birth’ and another ‘birth place’, you know they are the same, you would like those 2 properties to be traversed when someone looks for the natives of a particular city or country.

There are some initiatives, like the movement ‘freeyourmetadata’, that encourages you to give your metadata in the best possible shape, so that you help constructing graphs that can be easily and fully traversed.  DBpedia has to deal with the data that is already there in Wikipedia, they created an ontology to tackle this quality problem, so that you can find results even through synonyms.

I particularly liked his mention of Tim Berners-Lee’s vision on Linked Data:

 Internet = net of computers

World Wide Web = web of documents

… and the next step is

Giant Global Graph = graph of metadata    

… also called the semantic web,  open and linked data

I’m diving into Model Thinking

What is Model Thinking? It’s trying to findthe rules behind a particular behavior.  Creating a model forces you to name the variables that have a role to play, making it explicit what has to be taken into account in order to predict a particular behavior.  It’s great to come up with a model, but how do we know it really works?  Just checking reality against the result of our rules.  So in order to validate models we need real data.

And anybody knows by now that there is no problem to find data on the Internet.  That means it’s a great time to create models! Also, that big amount of data doesn’t make it easy to find what we are looking for, but the contrary, even if we have a very high probability for it to be there somewhere… even though search engines are really getting good at it, aren’t they?

wikimedia commons


Not to get drowned in data, we must structure it, interpret it and get information from it. To have a better view, nothing better than aggregating, generalizing, and creating models! And it’s a great time for that now: there are many programs available to manipulate raw data. Even if you are not looking for something specific, it is very interesting to play along with the data, and see what can be found out of it.  I’m particularly fond of Machine learning algorithms that allow us to find patterns we didn’t see for ourselves, but they don’t explain the rules behind those patterns, we still have to discover them. Or we can construct models and find the subjacent patterns. In both ways, understanding data allows us to predict new values, thus making our decisions more meaningful.

Modelling our world, we will end-up having more knowledge than before. We will be able to create policies knowing what variables need to be influenced in order to reach a particular goal, and we will be able to measure the impact of those measures.  We will be able to drive our society to a better future.  It makes me dream!

If you want to know more about it, check Prof. Scott Page from the University of Michigan.  He’s amazingly clear!

Lies checked against Data on Internet

I loved this article from  about an MIT student writing software that can highlight false claims in articles, just like spell check.

What do you think of living in a word without lies?  Where every fact could be proved true or faulse in seconds? Well, I’m going a little bit ahead.  For now, Dan Schultz, an MIT student, in partnership with PolitiFact,  wants to develop an API that will allow to check the written facts against the information gathered in their PolitiFact database.  The advances in the field of Natural Language Processing (NLP) are allowing amazing applications!

[…] Dan Schultz, a graduate student at the MIT Media Lab (and newly named Knight-Mozilla fellow for 2012), is devoting his thesis to automatic bullshit detection. Schultz is building what he calls truth goggles — not actual magical eyewear, alas, but software that flags suspicious claims in news articles and helps readers determine their truthiness. It’s possible because of a novel arrangement: Schultz struck a deal with fact-checker PolitiFact for access to its private APIs.If you had the truth goggles installed and came across Bachmann’s debate claim, the suspicious sentence might be highlighted. You would see right away that the congresswoman’s pants were on fire. And you could explore the data to discover that Bachmann, in fact, wears some of the more flammable pants in politics.

“I’m very interested in looking at ways to trigger people’s critical abilities so they think a little bit harder about what they’re reading…before adopting it into their worldview,” Schultz told me. It’s not that the truth isn’t out there, he says — it’s that it should be easier to find. He wants to embed critical thinking into news the way we embed photos and video today: “I want to bridge the gap between the corpus of facts and the actual media consumption experience.”

Imagine the possibilities, not just for news consumers but producers. Enhanced spell check for journalists! A suspicious sentence is underlined, offering more factual alternatives. Or maybe Clippy chimes in: “It looks like you’re lying to your readers!” The software could even be extended to email clients to debunk those chain letters from your crazy uncle in Florida.

The project is using natural language processing to verify facts, via API, against the information contained in PolitiFact. That is to say that it’s not able to tell a lie from the truth on its own, but rather it does so by pulling in data on phrases that are in a system. Sometime next year, when the project is finished, Schultz plans to open-source it and then the abilities should grow. for Christmas!

Zoe Kleinman  reported in the BBC NEWS that UK Government opens data to public.  Tim Berners-Lee, founder of the WEB, is behind this project, big mentality change for the UK governement 🙂 :

An ambitious website that will open up government data to the public will launch in beta, or pilot, form in December.

Reams of anonymous data about schools, crime and health could all be included. has been developed by Sir Tim Berners-Lee, founder of the web, and Professor Nigel Shadbolt at the University of Southampton.

It is designed to be similar to the Obama administration’s project, run by Vivek Kundra.
Mr Kundra is Chief Information Officer in the US. The American site, while not yet comprehensive, is already up and running, with improvements fuelled by user feedback.

This is good for the public and also for the UK government, there is a return of investment: is built with semantic web technology, which will enable the data it offers to be drawn together into links and threads as the user searches.

Let’s enjoy in December our Christmas gift, give a lot of feedback to improve the offer of the website, and encourage others to imitate the movement.

Can we predict social behavior from Internet data?

The New York  Times posted an article from :Government Aims to Build a ‘Data Eye in the Sky’, informing that a US Intelligence Unit launched a research program to analyse public data and find predictions of social and political relevance.

Now social scientists are trying to mine the vast resources of the Internet — Web searches and Twitter messages, Facebook and blog posts, the digital location trails generated by billions of cellphones — to do the same thing.[combine mathematics and psychology to predict the future, as the ‘psychohistory from Isaac Asimov]

The most optimistic researchers believe that these storehouses of “big data” will for the first time reveal sociological laws of human behavior — enabling them to predict political crises, revolutions and other forms of social and economic instability, just as physicists and chemists can predict natural phenomena.[…]

This summer a little-known intelligence agency began seeking ideas from academic social scientists and corporations for ways to automatically scan the Internet in 21 Latin American countries for “big data,” according to a research proposal being circulated by the agency. The three-year experiment, to begin in April, is being financed by the Intelligence Advanced Research Projects Activity, or Iarpa (pronounced eye-AR-puh), part of the office of the director of national intelligence.The automated data collection system is to focus on patterns of communication, consumption and movement of populations. It will use publicly accessible data, including Web search queries, blog entries, Internet traffic flow, financial market indicators, traffic webcams and changes in Wikipedia entries.

No need to mention that they also mentioned the data privacy issue in the article.  There are many comments to this news, and I extracted here an important part from Ike Solem’s first comment:

The fundamental flaw in Asimov’s notion of “predicting history” involves the mathematical concept of chaos, otherwise known as “sensitive dependence on initial conditions.”

[…] certain features of physical (and biological) systems exhibit sensitive dependence on initial conditions, such that a microscopic change in a key variable leads to a radically different outcome. While this has been heavily studied in areas like meteorology and orbital physics, it surely applies to ecology, economics, and human behavioral sciences too.

Thus, it’s a false notion that by collecting all this data on human societies, one can accurately predict future events. Some general trends might be evident, but even that is very uncertain. Just look at the gross failure of econometric models to predict economic collapses, if you want an example.

So there is always the possibility of an unforseen agent that changes the predicted behaviour.   Still, much more trends will be uncovered from the available big data sets than the ones discovered by human minds as it is up to now.  But what about the ‘quantum effect’?  If a trend is announced publicly, would that announcement make people to follow it just because they are expected to do so?  Or otherwise, wouldn’t it make them change their behavior radically?  I think we are still far away from human behavioral prediction.