The Art of Story Telling in Data Science and how to create data stories?
The idea of storytelling is fascinating; to take an idea or an incident, and turn it into a story. It brings the idea to life and makes it more interesting. This happens in our day to day life. Whether we narrate a funny incident or our findings, stories have always been the “go-to” to draw interest from listeners and readers alike.
For instance; when we talk of how one of our friends got scolded by a teacher, we tend to narrate the incident from the beginning so that a flow is maintained.
Let’s take an example of the most common driving distractions by gender. There are two ways to tell this.
The first is that I give you some statistics as follows:
- 6% of men believe texting is a distraction as compared to 4.2% of the women.
- Kids in the car cause 9.8% of the men to be distracted as compared to 26.3% of the women.
Another way to recreate similar statistics is this visual from kids4kars.org.
Which one do you think tells a better story?
Table of Contents
- The Need for Storytelling
- How to create stories?
- Begin with a pen-paper approach
- Dig deeper to identify the sole purpose of your story
- Use powerful headings
- Design a Road-Map
- Conclude with brevity
- Types of Data and Suitable Charts
- Text [Wordclouds]
- Mixed [Facet Grids]
- Numeric [Line Charts/Bar Charts]
- Stocks [Candlestick Charts]
- Geographic [Maps]
- Stories During the Steps of Predictive Modeling
- Data Exploration
- Feature Visualizing
- Model Creation
- Model Comparisons
- Best Practices of Storytelling
- End Notes
The need for storytelling
The art of storytelling is simple and complex at the same time. Stories provoke thought and bring out insights that could not have been understood or explained before. It’s often overlooked in data driven operations as we believe it’s a trivial task.
What we fail to understand is that the best stories not presented well end up being useless!
In several firms, the first step towards analyzing anything is story-boarding it. Questions like why do we have to analyze it? what decisions can we make out of it? Sometimes, data alone tells such visual and intricate stories that we don’t need to run complex correlations to confirm it.
The best example of needing stories and visuals to explain data is the Anscombe’s Quartet. The Anscombe’s Quartet is a set of four datasets with very similar statistical summaries, but completely different when you visualize them.
These are the four datasets used during the depiction of the Anscombe’s Quartet. If we look at mere numbers, we find that their summary statistics are almost identical.
Let’s see how they appear when we visualize them.
How to create stories?
To create a story or a plot is the first step to selling your ideas with a strong foot forward. Most people fail to think their stories through and cannot differentiate themselves from mediocrity. Let me take an example and guide you through the steps of creating stories.
We will be exploring a dataset that has news headlines and details of every stock price from the NASDAQ 100 tech companies. The columns selected are as follows.
1. Begin with a pen-paper approach
Visually engaging presentations will inspire your audience, but they definitely need more work to be put in. One of the best presentations have been created on rough pages and tissue papers.
Scripting down your ideas and flow before you start structuring your story is very essential to your final product.
The single most important thing you can do to dramatically improve your analytics is to have a story to tell. A flow that you can generate can have a lot of friction in your end result.
Aristotle’s classic five point plan that helps deliver strong impacts is:
- Deliver a story or statement that arouses the audience’s interest.
- Pose a problem or question that has to be solved or answered.
- Offer a solution to the problem you raised.
- Describe specific benefits for adopting the course of action set forth in your solution.
- State a call to action.
The way I structured my report was by involving plots that would give me a better understanding of my data.
The first idea that I had was, how can I make better business decisions of stocks by using the data that I have?
Involving a line graph would help me analyse trend lines of specific stock prices.
As I can see, February 2016 has been a drop for all stocks. This would help me scrape news articles only from that period to identify what caused the drop. Now, how do I select which news source to scrape from?
By identifying which news source reported most about a particular stock, we would have reason to believe that this is a good source for the specific stock.
2. Dig deeper to identify the sole purpose of your story
- Identify closely, what the idea of your story is. Ask yourself, “What am I really giving with this story?” It’s never the story alone, but what the story can do to make decision making better. What you’re displaying is the idea of a better decision making or analytics.
- Develop a personal “passion statement.” In one sentence, tell your prospects and why you are genuinely excited about working with them. Your passion statement will be remembered long.
3. Use powerful headings
- Create your heading, a one-sentence statement for your story, visual, or analysis. The most effective headlines are concise, specific, and offer a personal benefit.
- Remember, your heading is a statement that offers your audience a vision of a better understanding. It’s not about you. It’s about them.
4. Design a Road-Map
5. Conclude with brevity
Now that you have put forward all points of your story, your conclusion should be short and powerful. In my report, I mentioned small 3-4 liner summaries to conclude why to buy a particular stock.
Types of Data and Suitable Charts
Let us see the common types of data we encounter and how to tell stories from those, by selecting the best fit charts.
Commonly encountered types of data:
1. Textual Data
When data is found in this form, it’s usually good to be finding how often a word has been used or what the sentiment of the text is. Stories can be told best using this form of data.
One of the best suited visualizations for textual data is the WordCloud. The wordcloud brings the more frequent ones to the center and enlarges them, giving us a clear picture of what the general idea of the text depicts.
For example, the wordcloud in this article displayed above gives a representation of twitter dataset. It shows that love is the most frequent positive term used in the tweets.
2. Mixed Data
When our data consists of numeric or any other variety of formats, we need to know which ones are important and give us better insights from our dataset.
The preferred visual for this kind of data can vary; here I will show you how to use facet grids for the data. I will be using the Titanic Passenger Data.
As this plot shows us, females and first-class passengers tend to have a higher survival chance than men who are a part of the crew or lower boarding classes.
Isn’t that what had really happened on the Titanic?
Another way to visualize this kind of data is by trying a multivariate plot. The dataset in use for this plot is the Car Performance and Specifications dataset.
Here we can see how Cars that have a heavier built are slower than the ones with lighter bodies. Makes sense, right?
3. Numeric Data
When we encounter this kind of data, we’re usually looking for trends or lines that depict numbers. The visual that would suit numeric data best would be a line or a step graph.
Here, we can very clearly see the rise of prices at a local attraction for adults and children. See how easy it is to see the growth at each year interval?
One of the datasets that we also encounter are related to stocks. Stock market data is primarily a time series data of numeric values, but as a trader or an investor, I would like to understand each date and drop carefully.
The most visually captivating charts in this regard is the Candlestick chart.
Here, we take the example of Tesla’s stocks. The candlestick charts can be used to maneuver across each date and see the lows and highs of stocks individually. This could help us take better investment decisions based on current or past market trends.
As the graph shows us, February 2016 was a drop for Tesla’s stocks. We could now use this information to understand other market conditions and economic situations to make decisions about their stock.
5. Geographic Data
When we have data pertaining to specific locations and areas, we use maps to add clarity and meaning to our analysis.
In this example, we can see how countries fared at and after the 2002 World Cup. Germany has scored the maximum number of goals, being one of the most dominant teams in world football ever since.
Storytelling during the steps of predictive modeling
Often, we would be questioned about how our stories and visuals can work or help when it’s time to create mathematical models. During all stages of predictive modeling, storytelling could be a vital addition to your analysis.
Let us understand the basic steps involved in creating models out of our data and go through telling stories within them.
1. Data Exploration
The first step of model building is understanding your data. I’ll give you instances and show you how you can explore your data without computing complex statistics.
Let’s consider a dataset on Wine Quality. This is the structure of the dataset is as follows
Here, we can see the associated summary statistics of the dataset in use.
So, if we need to see whether there is any correlation between alcohol volumes and wine qualities, how do we do it?
We could either compute Pearson’s ‘r’. It would help us in building a model, but would not help us in analyzing much.
This shows a very strong correlation between Alcohol content and wine quality. But does it tell you anything else?
Ideally, it doesn’t. So, what does?
Let’s see how we can visualize these and tell a lot more from them.
First, we’ll begin by seeing how Wine Quality relates to Alcohol content.
Here, we can see that the higher alcohol volumes relate to better wine qualities and it helps us come to a better understanding of our data. We can also spot outliers better in this scenario.
Next, would you wonder how acid contents in your wine affect its quality?
This would be one way to visualise the effects of acid. As the Violin Plot expands horizontally, it shows that there are higher numbers of data points within those areas.
2. Feature Visualizing
After you generate features, how do you see how well one is predicting?
Graphs tell us how far away our predicted points are from our fitted line.
Another example where we might have to visualize newly created visuals is the Principal Component Analysis. If you want to get an in-depth understanding of PCA, you can go through this article.
This is the Iris dataset found in RStudio.
Although when we plot this, we find that the resulting visual is much more informative than the statistics.
3. Model Creation and Comparison
Coming to the model creation phase, we usually find the need to understand how our data is being fitted.
This is a model that predicts whether the car should go fast or slow, based on the grade of the road and bumpiness.
As you can see, the decision boundary clearly classifies most of the data but an accuracy of 88.21% doesn’t tell much of a story. Here we can even see how far the misclassified points are from the decision boundary.
We can also compare certain algorithms and techniques by looking at their decision boundaries as we did above.
Another example using the Iris dataset is shown below.
Here, there’s not much information to derive valuable insights about our model.
To learn more about Support Vector Machines, you can go through this article.
On the other hand, this plot shows us a clear classification boundary where the Species separate from each other.
Best Practices for Story Telling
Now that you know the scenarios where we can use story telling to explain our point, I will give you a few practical tips when you take this up on your own.
- Always label your axes and give the heading of your plot.
- Use legends where necessary.
- Use colours that are lighter on the eye and in proportion.
- Avoid adding unnecessary detail to your visualisation like backgrounds or themes that don’t allow good readability.
- Only a point can be used to simultaneously encode two quantitative values based on a horizontal and vertical location.
- Never use points for visualisation if you are doing time series encoding.
Storytelling is more than what it has been used for. It can uncover insights from your data that you might have missed before. Relations between features and data that numbers can never clearly depict, can be shown using stories and charts.
In this piece, we’ve elaborated on how stories are used in almost all avenues to explain a detail better. Starting from how they’re used in the steps of model building, we’ve gradually gone on to which charts suit specific data types well.
Hope you had a great time reading the article. Eager to hear your data stories!