Elections warn about ethical issues in algorithms

I tweeted recently on this article about how Big Data has been used on the last American Presidential campaign.

Concordia Summit, New York 2016

“At Cambridge,” he said, “we were able to form a model to predict the personality of every single adult in the United States of America.” The hall is captivated. According to Nix, the success of Cambridge Analytica’s marketing is based on a combination of three elements: behavioral science using the OCEAN Model, Big Data analysis, and ad targeting. Ad targeting is personalized advertising, aligned as accurately as possible to the personality of an individual consumer.

Nix candidly explains how his company does this. First, Cambridge Analytica buys personal data from a range of different sources, like land registries, automotive data, shopping data, bonus cards, club memberships, what magazines you read, what churches you attend. Nix displays the logos of globally active data brokers like Acxiom and Experian—in the US, almost all personal data is for sale. […] Now Cambridge Analytica aggregates this data with the electoral rolls of the Republican party and online data and calculates a Big Five personality profile. Digital footprints suddenly become real people with fears, needs, interests, and residential addresses.
[…]

Nix shows how psychographically categorized voters can be differently addressed, based on the example of gun rights, the 2nd Amendment: “For a highly neurotic and conscientious audience the threat of a burglary—and the insurance policy of a gun.” An image on the left shows the hand of an intruder smashing a window. The right side shows a man and a child standing in a field at sunset, both holding guns, clearly shooting ducks: “Conversely, for a closed and agreeable audience. People who care about tradition, and habits, and family.”

Now I came across this other article by Peter Diamandis, featuring what we can expect in 4 year’s time for the next future elections’ campaign.

5 Big Tech Trends That Will Make This Election Look Tame

5 Big Tech Trends That Will Make This Election Look Tame

If you think this election is insane, wait until 2020.

I want you to imagine how, in four years’ time, technologies like AI, machine learning, sensors and networks will accelerate.

Political campaigns are about to get hyper-personalized thanks to advances in a few exponential technologies.

Imagine a candidate who now knows everything about you, who can reach you wherever you happen to be looking, and who can use info scraped from social media (and intuited by machine learning algorithms) to speak directly to you and your interests.

[…] For example, imagine I’m walking down the street to my local coffee shop and a photorealistic avatar of the presidential candidate on the bus stop advertisement I pass turns to me and says:

“Hi Peter, I’m running for president. I know you have two five-year-old boys going to kindergarten at XYZ school. Do you know that my policy means that we’ll be cutting tuition in half for you? That means you’ll immediately save $10,000 if you vote for me…”

If you pause and listen, the candidate’s avatar may continue: […] “I’d really appreciate your vote. Every vote and every dollar counts. Do you mind flicking me a $1 sticker to show your support?”

I know, this last article is from the SingularityHub, but even though they tend to be alarming, knowing how fast technology advances, the predictions they advance are not too exaggerated…

In any way, that reminds me how important it is to ACT on the ethical issues of algorithms. Please notice the capital letters to stress on the movement, which is to take action.  There are many issues that need to be identify, to be discussed, to raise awareness upon, to regulate, and on some of them we can already act on at company level.

I talked in May last year at the Data Innovation Summit about the biases that can be (and usually are) replicated by the new algorithms based on data.  Since then I began working on a training program to help identify and correct those bias when designing and using algorithms, and I’m reminded with the above mentioned articles that this cannot be delayed, it’s needed right now.

So if you are interested on getting your people and organization be aware of biases (human biases and digital ones), and be trained to fix these issues, contact me!

EmojiOne

We are creating our future, let’s don’t close our eyes, we can take control and assume our responsibility setting the railings that will guide the path to our future society.

 

AI and Machine Learning in business: use it everywhere!

How One Clothing Company Blends AI and Human Expertise, HBR nov-16

How One Clothing Company Blends AI and Human Expertise, HBR nov-16

Last week Bev from PWI’s group in Linkedin pointed me to a great HBR article: “How One Clothing Company Blends AI and Human Expertise”, by H. James Wilson, Paul Daugherty and Prashant Shukla.

It describes how the company Stitch Fix works, using machine learning insights to assist their designers, and as you will see, they use machine learning at many levels throughout the company.

The company offers a subscription clothing and styling service that delivers apparel to its customers’ doors. But users of the service don’t actually shop for clothes; in fact, Stitch Fix doesn’t even have an online store. Instead, customers fill out style surveys, provide measurements, offer up Pinterest boards, and send in personal notes. Machine learning algorithms digest all of this eclectic and unstructured information. An interface communicates the algorithms’ results along with more-nuanced data, such as the personal notes, to the company’s fashion stylists, who then select five items from a variety of brands to send to the customer. Customers keep what they like and return anything that doesn’t suit them.

The Key factor of success for the company is to be good at recommending clothes that not only will fit the customer and that they’ll like enough to keep them, but better than just ‘like them’, that they like them enough to be happy with their subscription.

Stitch Fix, which lives and dies by the quality of its suggestions, has no choice but to do better [than Amazon and Netflix].

Unlike Amazon and Netflix that recommend directly products to the customers, here they use machine learning methods to provide digested information to their human stylists and designers.

[…] companies can use machines to supercharge the productivity and effectiveness of workers in unprecedented ways […]

Algorithms are for example analysing the measurements to find other clients with same body shape, so they can use the knowledge of what items fitted those other clients: the clothes that those other clients kept. Algorithms are also used to extract information of clients’ taste on styles, from brands preferences and their comments on collections.  Human stylists, using the results of that data analysis and reading the client’s notes, are better equipped to choose clothes that will suit the customers.

Next, it’s time to pick the actual [item of clothe] to be shipped. This is up to the stylist, who takes into account a client’s notes or the occasion for which the client is shopping. In addition, the stylist can include a personal note with the shipment, fostering a relationship, which Stitch Fix hopes will encourage even more useful feedback.

This human-in-the-loop recommendation system uses multiple information streams to help it improve.

See how stylists maintain a human dialog with their clients through the included note. This personalised contact is usually well appreciated by customers and it has a positive effect for the company because it opens the door to receive their feedback to better tailor their next delivery.

The company is testing natural language processing for reading and categorizing notes from clients — whether it received positive or negative feedback, for instance, or whether a client wants a new outfit for a baby shower or for an important business meeting. Stylists help to identify and summarize textual information from clients and catch mistakes in categorization.

The machine learning systems arelearning through experience’ (=adapting with the feedback) as usual, but in a humanly ‘supervised’ way. This supervision allows them to try new algorithms without the risk of losing clients if results are not as good as expected.

Stitch Fix employs more than 2,800 stylists, dispersed across the country, all of them working from home and setting their own hours. In this distributed workforce, stylists are measured by a variety of metrics, including the amount of money a client spends, client satisfaction, and the number of items a client keeps per delivery. But one of the most important factors is the rate at which a stylist puts together a collection of clothes for a client.

Speed is an important factor to satisfy their customers’ demands, and machine learning gives them the needed insight so much quicker than if stylists had to go through all the raw data!

This is where the work interface comes into effect. To enable fast decision making, the screen on which a stylist views recommendations shows the relevant information the company keeps about a client, including apparel and feedback history, measurements, and tolerance for fashion risks — it’s all readily accessible

The interface itself, which shows the information to the stylist, is also adapting through feedback, being tested for better performance.  And you could go again one step further and check for bias on the stylists:

Stitch Fix’s system can vary the information a stylist sees to test for bias. For instance, how might a picture of a client affect a stylist’s choices? Or knowledge about a client’s age? Does it help or hinder to know where a client lives?

By measuring the impact of modified information in the stylist interface, the company is developing a systematic way to measure improvements in human judgment

And there are many other machine learning algorithms throughout the company:

[…]the company has hundreds of algorithms, like a styling algorithm that matches products to clients; an algorithm that matches stylists with clients; an algorithm that calculates how happy a customer is with the service; and one that figures out how much and what kind of inventory the company should buy.

The company is also using the information of the kept and returned items to find fashion trends:

From this seemingly simple data, the team has been able to uncover which trends change with the seasons and which fashions are going out of style.

The data they are collecting is also helping advance research on computer vision systems:

[…] system that can interpret style and extract a kind of style measurement from images of clothes. The system itself would undergo unsupervised learning, taking in a huge number of images and then extracting patterns or features and deciding what kinds of styles are similar to each other. This “auto-styler” could be used to automatically sort inventory and improve selections for customers.

In addition to developing an algorithmic trend-spotter and an auto-styler, Stitch Fix is developing brand new styles — fashions born entirely from data. The company calls them frankenstyles. These new styles are created from a “genetic algorithm,” modeled after the process of natural selection in biological evolution. The company’s genetic algorithm starts with existing styles that are randomly modified over the course of many simulated “generations.” Over time, a sleeve style from one garment and a color or pattern from another, for instance, “evolve” into a whole new shirt.

How does a company using so many machine learning systems look like at employee level? How is it perceived by the employees? This is what they say:

Even with the constant monitoring and algorithms that guide decision making, according to internal surveys, Stitch Fix stylists are mostly satisfied with the work. And this type of work, built around augmented creativity and flexible schedules, will play an important role in the workforce of the future.

Machine learning and AI (artificial intelligence) systems are changing the way companies do business.  They are providing an insight that either could not be grasped before, or that it could, but not at that speed, nor being accessible as a tool to assist each and every employee.

The least that can be said is that this will improve productivity in all sectors and, as today almost everyone has access to the Internet to verify a word, look for a translation, a recipe, check the  weather and countless other uses, the new generation of employees will be assisted by tons of algorithms that will analyse data and deduce, induce or summarize information to assist them in their work and in their decision-making.

Sexism spotted with Maths!

cc-restore2

I did a talk in May this year called ‘Restore the balance of data’ at the Data Innovation Summit.  It was about sexism and other biases that are implicit in our existing electronic traces (actual and historical data) and my concern because we are using that data as baseline information to create the new prediction algorithms.

I’ve discussed this many times at home when preparing the talk.  We had vivid discussions with my husband and lovely sons over our family Sunday lunches. That’s how it didn’t surprise me that my eldest son, Alex, thought of me when reading  this article of the MIT Technology Review about sexism in our language.

The article is about a dataset of texts that researchers are using to “better understand everything from machine translation to intelligent Web searching.”  They are transforming words in the text into vectors, and then applying mathematical properties to derive meaning:

It turned out that words with similar meanings occupied similar parts of this vector space. And the relationships between words could be captured by simple vector algebra. For example, “man is to king as woman is to queen” or, using the common notation, “man : king :: woman : queen.” Other relationships quickly emerged too such as  “sister : woman :: brother : man,” and so on. These relationships are known as word embeddings.

The article is about the problem that researchers have identified on this data set, they say “: it is blatantly sexist.”  Here are some examples they provide:

But ask the database “father : doctor :: mother : x” and it will say x = nurse. And the query “man : computer programmer :: woman : x” gives x = homemaker.

Thinking about it, isn’t it obvious that if we have biases on our behavior, the writings about our world would be biased too?  And anything derived from our biased writing traces will reflect our views with all our biases too.

So we learned to extrapolate from our old behavior to predict our future behaviour… just to discover that we don’t like what we are getting out of it!  Our old behavior, amplified by the algorithm, doesn’t seem so good isn’t it? It’s clearer than ever that we don’t want to continue behaving like that in the future… Well, that’s a positive point, it’s good that this uncovers our blind spots, isn’t it?

Now the good news: it can be fixed!

The Boston team has a solution. Since a vector space is a mathematical object, it can be manipulated with standard mathematical tools.

The solution is obvious. Sexism can be thought of as a kind of warping of this vector space. Indeed, the gender bias itself is a property that the team can search for in the vector space. So fixing it is just a question of applying the opposite warp in a way that preserves the overall structure of the space.

Oh, seems so easy…for mathematicians anyway 😉  But no, even for mathematicians it is difficult to find and to measure the distortions:

That’s the theory. In practice, the tricky part is measuring the nature of this warping. The team does this by searching the vector space for word pairs that produce a similar vector to “she: he.” This reveals a huge list of gender analogies. For example, she;he::midwife:doctor; sewing:carpentry; registered_nurse:physician; whore:coward; hairdresser:barber; nude:shirtless; boobs:ass; giggling:grinning; nanny:chauffeur, and so on.

Having compiled a comprehensive list of gender biased pairs, the team used this data to work out how it is reflected in the shape of the vector space and how the space can be transformed to remove this warping. They call this process  “hard de-biasing.”

Finally, they use the transformed vector space to produce a new list of gender analogies[…]

Read the full article if you are interested on their process to de-biased.  Their conclusion, with which I completely agree is:

“One perspective on bias in word embeddings is that it merely reflects bias in society, and therefore one should attempt to debias society rather than word embeddings,” say Bolukbasi and co. “However, by reducing the bias in today’s computer systems (or at least not amplifying the bias), which is increasingly reliant on word embeddings, in a small way debiased word embeddings can hopefully contribute to reducing gender bias in society.”

That seems a worthy goal. As the Boston team concludes: “At the very least, machine learning should not be used to inadvertently amplify these biases.”

 

DIS2016 Restore the balance of data

Two weeks ago was the Data Innovation Summit 2016.  I was due to speak using the presentation format of ‘ignite’.  For the ones who don’t know this format, it’s a nightmare! Out of joke, it means that slides go automatically at regular intervals (15″ in my case).  You cannot stop it, you don’t control the flow… so to be synchronized, you really have to prepare your speech in advance, you must know exactly how much time it takes to explain each of your points, what examples you’ll be presenting (check it out, 15 seconds go very quickly when you’re looking for your words :-))).

So here it is, my 5′ presentation, if you only count the time on scene…

Big Data workshop at the First European Celebration of Women in Computing

ECWCThis last Tuesday, I lead the ‘Discover Big Data’ workshop at the First European Celebration of Women in Computing.  There were many parallel sessions that morning and I received some questions about my presentation from the participants that couldn’t divide themselves to attend this workshop 😉

Welcome to the Big Data workshop, we need women in Big Data!

This workshop is called ‘Discover Big Data’ because Big Data is a hyped word. It is being used for anything where data is involved, but it still remains confusing as what it means.

  • You are also in Big Data  if you are dealing with data that has to be processed at great velocity, as is the case for GPS or for mobile phones.
  • You are in Big Data if you cross information that come on a variety of formats, like your customer’s transactions and your customer’s emails, or if you go to the social networks, like Facebook or  Twitter.  You can discover what are the topics being discussed, what is being said about your company or  who is talking about your product.
  • You are in Big Data  if you’re exploiting one of the many big available datasets like weather information, official administration records like property records or  financial information, economic indicators…

What can be done with Big Data?

It is mainly used for customer intimacy, discovering your customer profiles and target them on a one to one base. Finding their preferences and the hidden patterns to predict customer churn.

It can be used for optimisation, finding patterns of systematic problems hidden in your historical data. It can help for organising your maintenance, or to improve the supply-chain, finding better logistic solutions, optimise processes.

It is also used for innovation: It can help you create your new product. Looking at your competitors and finding the white-spaces or uncovering market trends.

And more generally, with all the available data you can create models forecasting future events and behaviors. Through what-if analysis to predict the outcomes of potential changes, you can direct your business strategy. It helps anticipating previously unforeseen opportunities, as well as avoiding costly situations, finding new revenue opportunities or identifying more effective business models.

As you see, there are great business opportunities!

How can we do all that?

There are many techniques like statistical analysis, data mining, text analysis, sentiment analysis, graph analysis, machine learning, predictive analysis, neural networks, conceptual clustering…

You may have heard already some of those words that sound promising but that also sound very complicated. And even so, the Big Data field is growing exponentially as men are running for it.  There are only 10% of women, don’t you want to be part of it? Companies that took this wave are thriving, well ahead of classical business. They are proposing you the right product at the right time, with the features you are looking for, for the price you are willing to pay. They are  increasing their profits while shaping our future with the products and business strategies they are creating.

I hear you saying: This is great but I don’t know a thing about this and it sounds so complicated. I’m here to tell you that not all of it is that difficult.

YOU could be in Big Data.

If you are in computing you have a leg up. And if you like mathematics you’ll enjoy being a data scientist. But you could be in Big Data even if you are not a techy person.  If you are in HR, in marketing, if you are a manager or a decision-maker with the right mindset open to data, you can exploit the Big Data wave.

Even if you see the potential, women tend to think ‘it’s not for me, I don’t have the competencies’.

Let me use some feminine stereotypes to illustrate we have the basic skills:

  • We have a tradition of getting together and talking too much.  And we have a tendency to be matchmakers.  We can put those skills of information gathering and making connections to good use finding relationships between data.
  • Who recognises herself in this? We are control freaks and plan everything, even the time of our loved ones.  Don’t you have a TODO list for your partner on Saturdays?  I do: Love, since you are driving Alex to the scouts, could you please pass by and drop the trousers at the dry cleaner?  What if you knew what your GPS knows already, that a road is blocked?  You could have asked him to bring some bread back as he’s going to pass near the bakery.  Don’t you feel satisfaction when doing things efficiently, optimising the Saturday time? So imagine tapping into all the available information and using it to improve the processes, it’s a rewarding job.
  • And if you have artistic skills, visualisation is your field. This is a new branch of data science, they are creating new techniques very interesting to show more than 3 dimensions of data, so you can see easily relationships graphically.
  • Generally speaking, I think we women have a natural talent to be data analysts: the ‘What if’ comes natural to us, we always investigate all possibilities before deciding for one, isn’t it?

Summarising, we saw there is business in here, and that we have the basic skills to be in the Data business.

Moreover, it is important that more women move into this field, not only because of the many business opportunities, but also because there are ethical issues involved in Big Data. We can mention data privacy and price gauging as some of these issues, but there are other business models that can be controversial.

The rules of what can be done with the data and what is off-limits, are being defined right now.  Let’s not miss the opportunity to give our view on this.

As an example, there is a great initiative from the Data2X program of the UN, who’s doing a research on women’s freedom of movements through satellite images and their phone geolocation.  Are they limited in their movements in some countries, do they have access to education, to health care? Great initiative, but what about the same at a private level: is following the movement of your partner with her/his phone geolocation ethical? What about tracking the movement of your children, as it’s done already in some countries?

It’s important to have our saying in the ethical uses of all those lakes of data and be represented in the decisions that will define our future society. We, women, have a natural tendency of looking after our loved ones, taking their needs in consideration. That’s what Big Data is needing, people that set the rules for using the incredible amounts of data, taking into account the different perspectives and with a long term view in mind. It’s the moment to use our feminine voice to shape a better society for all of us, participating also in the creation of the new business models.

In this workshop you will hear success stories to show you the opportunities to be included in this field. I hope you’ll join the Big Data movement.

Pre-Crime unit for tracking Terrorists?

minority-report-11-3Due to last events in Belgium, the terrorist bomb attacks in Zaventem and Brussels, I couldn’t but remember the article from Bloomberg Businessweek talking about pre-crime: ‘China Tries Its Hand at Pre-Crime’.  They refer us to the film Minority Report, with Tom Cruise, that takes place in a future society where three mutants foresee all crime before it occurs. Plugged into a great machine, these “precogs” are at the base of a police unit (Pre-Crime unit) that arrests murderers before they commit their crimes.

China Electronics Technology company won recently the contract for constructing the ‘United information environment’ as they call it, an ‘antiterrorism’ platform as declared by the Chinese government:

The Communist Party has directed [them] to develop software to collate data on jobs, hobbies, consumption habits, and other behavior of ordinary citizens to predict terrorist acts before they occur.

This may seem a little too much to ask, if you think about it you may need every daily detail to be able to predict terrorist behaviour, but in a country like China where the state has control over their citizens since many decades, where they have no privacy limits to respect and a good network of informants…

A draft cybersecurity law unveiled in July grants the government almost unbridled access to user data in the name of national security. “If neither legal restrictions nor unfettered political debate about Big Brother surveillance is a factor for a regime, then there are many different sorts of data that could be collated and cross-referenced to help identify possible terrorists or subversives,” says Paul Pillar, a nonresident fellow at the Brookings Institution.

See how now there is also a new target: subversives.  the article continues:

China was a surveillance state long before Edward Snowden clued Americans in to the extent of domestic spying. Since the Mao era, the government has kept a secret file, called a dang’an, on almost everyone. Dang’an contain school reports, health records, work permits, personality assessments, and other information that might be considered confidential and private in other countries. The contents of the dang’an can determine whether a citizen is eligible for a promotion or can secure a coveted urban residency permit. The government revealed last year that it was also building a nationwide database that would score citizens on their trustworthiness.

Wait a second, who’s defining what is ‘trustworthiness’, and what if you’re not?

New antiterror laws that went into effect on Jan. 1 allow authorities to gain access to bank accounts, telecommunications, and a national network of surveillance cameras called Skynet. Companies including Baidu, China’s leading search engine; Tencent, operator of the popular social messaging app WeChat; and Sina, which controls the Weibo microblogging site, already cooperate with official requests for information, according to a report from the U.S. Congressional Research Service. A Baidu spokesman says the company wasn’t involved in the new antiterror initiative.

So Skynet is here now (remember Terminator Genisys?). Even if right after a horrendous crime you can be tempted to be happy that this ‘pre-crime’ initiative is being constructed, there are way too many negative aspects still to consider before having such a tool. Like in which hands will it be, who’s defining what is a crime, what about your free will of changing your mind, to mention some.  Let’s begin thinking how to tackle them.

Great visualisation tips

I would like to share with you this article on the Harvard Business Review.  They give excellent advice to ‘make extreme numbers resonate’.  They give 3 examples to illustrate their tips:

  1. Challenge: Green Mountain sold 18 billion coffee pods in two years. How can you give people a concrete sense of just how many objects that is?
    HBR-Visual Huge numbers- R1601Z_VS_CUPS_B-1024x774

 

  1. Challenge: Only three in 10,000 high school basketball players ever make it to the NBA. How can you give someone a deep understanding of the rarity of that feat?
    HBR -Visual small numbers-R1601Z_VS_BASKETBALL-1024x584

 

  1. Challenge: Every year tens of thousands of people leave one U.S. city for another. How can you show changes on this scale when it’s so hard to keep track of complex movement? […]

 

HBR -Visual complexity -R1601Z_VS_MOVEMENT-1024x568

In the first example, they give tips to visualise huge numbers, the second one is for small numbers, but the the third one is really interesting, as it shows an extremely clear way to picture complexity.

 

 

The Value of Emotional Connection

HBR-Emotions MAGIDS_value_v4-small

Scott Magids, Alan Zorfas and Daniel Leemon tell us that research on motivational values is paying off:

Our research across hundreds of brands in dozens of categories shows that it’s possible to rigorously measure and strategically target the feelings that drive customers’ behavior. We call them “emotional motivators.” They provide a better gauge of customers’ future value to a firm than any other metric, including brand awareness and customer satisfaction, and can be an important new source of growth and profitability.

The article guides you through a detailed process to find out your customers’ motivators, that begins with:

Online surveys can help you quantify the relevance of individual motivators. Are your customers more driven by life in the moment or by future goals? Do they place greater value on social acceptance or on individuality? Don’t assume you know what motivates customers just because you know who they are. Young parents may be motivated by a desire to provide security for their families—or by an urge to escape and have some fun (you will probably find both types in your customer base). And don’t undermine your understanding of customers’ emotions by focusing on how people feel about your brand or how they say it makes them feel. You need to understand their underlying motivations separate from your brand.

Check here the full Harvard Business Review’s article for the full description. What is surprising is this finding:

To increase revenue and market share, many companies focus on turning dissatisfied customers into satisfied ones. However, our analysis shows that moving customers from highly satisfied to fully connected can have three times the return of moving them from unconnected to highly satisfied. And the highest returns we’ve seen have come from focusing on customers who are already fully connected to the category—from maximizing their value and attracting more of them to your brand.

It is analogous to the different strategies used on education:

  • In secondary school you have to get a minimum knowledge from all the courses you have.  It is frequent that students must focus on the ones for which they are not naturally talented.
  • In higher studies, it pays to focus on your strengths, on your best skills, and to improve them until you are really good at them.

It’s not frequent to get youngsters very motivated by the courses they don’t really like, even if they finish the year managing them enough to pass. It is no surprise that it is easier to motivate the second group, and as a result, seems reasonable that the acquired knowledge or skill may be more astonishing on the second group than on the first one. Surprising not have had this intuition and need a research to show it with data.

How to lie with charts

I hope you didn’t miss the article on visualization from the Harvard Business Review.  It is called ‘Vision statement: How to lie with charts‘, and it’s full of clear stated examples.

http://en.wikipedia.org/wiki/United_States_presidential_election,_2008

Source: Wikipedia

This color-coded map is one of the examples they show where coloring a county with the political color of the majority vote in that state is misleading.  The map represents the 2008 election (Obama versus McCain) and we can see 80% of the US colored in red (the Republican color), and in fact the Republican candidate John McCain received only 40% of the votes.  The mismatch of the (natural) election’s expectation after looking at this map and the real outcome comes from representing in a map information that is not related to geography.  The number of votes in a county or a state is not proportional with its geographical size.
[..] you could call it the New York City problem -0,01% of the area but 2,7% of the population.

A suggested better representation is using bubbles with sizes proportional of the number of votes, ending with this map showing more correctly a majority of blue instead.

Source: hbr.org

Source: hbr.org

Visualization is growing in importance nowadays that we have so much data all around.  Visualization can help to identify trends, to find patterns, to show relations between data.  It can show what the data represents, putting it in an intuitive way.

But as this article shows, used in a wrong way, visualization can mislead you just as well.

To be on the safe side, it’s better to check the numbers or data behind the representation in order to confirm what the image is showing you … or if somebody is not tricking you!

Correlation and Causation in Big Data

Big data began as a term used when you have extremely large data sets, These big data sets cannot  be managed nor analyzed with conventional database programs not only because  of the size exceeding the capacities of standard data management , but also because of the variety and unstructured nature of the data (it comes from different sources as the sales department, customer contact center, social media, mobile devices and so on) and because of the velocity at which it moves (imagine what it entails for a GPS to recalculate continually the next move to stay on the best route and avoid traffic jams: looking at all traffic information coming from official instances as well as from other drivers on real time, and transmitting all the details before the car reaches a crossroad).

The term ‘Big Data’ is also used to identify the new technology needed to compute the data and reveal patterns, trends, and associations.  Furthermore, this term is now synonym of big data’s analytical power and its business potential that will help companies and organizations improve operations and make faster, more intelligent decisions.

What is big data used for?

First and the more evident part is to do statistics: how many chocolates have we sold? What are the global sales around the world, splitted per country? Where do the customers come from?

Then correlation comes to play:  things that have the same tendency, that go together or that move together: countries that are strong on chocolate sells also have  a lot of PhDs.

Thanks to http://tylervigen.com

Thanks to http://tylervigen.com

Correlation is not causality. It’s not because you eat chocolate that you become a PhD (nor the other way around, having a PhD doesn’t mean you are more likely of loving chocolate).  Analyzing correlations is still a big deal.  It can be a conjunction, like with thunder and lightning. It can be a causality relation, and even when there is causality, it is hard to say the direction of the relationship, what is the cause and what its effect.  Nevertheless, big data predictive behaviour analysis is doing a great job, even when the ‘why’s behind it, the underlying causes, are still hidden, not explained.

The great potential in Big data is that it helps us discover correlations, patterns and trends where we couldn’t see them before, but it’s up to us to create theories and models that can explain the relations behind the correlations.