The sinking ship Titanic database were used in all the exercises and lectures I gave over the last five years as an example for explanation or as a material to test the skills of the trainees. It is suitable for training on many levels; in advanced research on the search engines, in organizing and cleaning the data or analyzing them using the conditional formatting and pivot tables in Microsoft Excel or in training on data visualization.
I thought it is a personal habit, till I met the British journalist Jonathan Stoneman and I worked with him as a partner trainer for a few days in Tunisia. And during our conversation in the night before the first day of training when we were planning for the practical applications that we will use over the coming training days. Jonathan suggested that we use the cards game in the first part of the training to motivate the trainees to think by the data journalist methodology of creating relations between the available data and to discuss with who is sitting next to them who carry a different card from the same database to think together in the story that they can reach if they only have those two columns.
And when I asked him about the database that we will use in this application, he answered: “Definitely, Titanic”. The funny thing about it is that I didn’t have the chance to see the film before; I didn’t have curiosity for it. But through the data I imagined a number of scenes about the variety of the ship passengers and where they stay on its deck and the evacuation process of the passengers. There are a lot of scenes and stories that haven’t been told yet that could be extracted from the passengers’ database, what pushed me to write a series of different blogs to explain the methodology of working like a data journalist and sharing with you some skills and techniques that could be applied on the Titanic passengers’ database.
There are many versions available on the internet for the data base with different formats and inputs, some of it includes data for passengers and crew and identify the state of each of them after the accident, and some include only the data of the passengers and include data that defines the sex of the passenger, his age and his address and the amount of money he paid for his ticket and the business class of his ticket. The database that I always prefer to work with has the title “Titanic Dataset, V3.5”.
It is easy to get after a quick search on Google search engine. In this blog, we will start from the data table, we won’t turn to how to search on the in the internet or to shovel the data from the web pages and I won’t turn to checking the data and its source. Of course these are things that should be considered by the data journalist or any other journalist when working on data he didn’t collect himself. But I dedicated this group of blogs to talk about the steps to follow doing this work, starting with asking the questions and meeting the data going through analyzing them to find journalistic stories within, till designing the visual that aids in telling the story.
Exploring the data
The database looks organized and simple; it consists of one worksheet that contains more than 18 thousands cells and 1309 rows, each represents the data of one passenger of the ship passengers and 14 columns each represents different variable like the passenger age, the port from which he boarded and other variables. By scanning the data we find that is divided to quantitative data like the number of relatives from the children and parents, the numbers of the sinking corpses, descriptive data like the gender of the passenger and defining his status being a survivor or drowned. And to be able to explore the data accurately, it is preferable to make a table using Microsoft excel by clicking on the insert list and choosing a table.
The table provides the filtration feature on the head of each column what simplifies the sorting and the rearrangement process. And we can benefit from it by knowing the range of data in each column. For example, by clicking on the small arrow that appears on the top left of the business class column, we will discover that the data list includes 3 business classes; the first, the second and the third class. And that the data are organized according to the business class, as the passengers of the first class appear one after another then the second class passengers then the third.
In most of the cases the data collected by other people comes with an introductory file or a work paper that includes the descriptive data that explains the abbreviations and the calculation methods. This is not available with these data, so it is preferable to create our own guide for understanding the data. It is preferable to write that in a new worksheet in the same excel sheet.
|Column title||Description||Data type||Categories number|
(1 = first class / 2 = second class / 3 = third class)
(1 = survivor, 0 = drowned)
*The children younger than 1 year old = the age in months divided by 12)
|Numerical||Less than 1 to 80|
|SibSp||The number of siblings and/or couples on board||Numerical||From 0 to 9|
|Ticket||The ticket number||Numerical|
|Fare||The ticket price
|Cabin||The cabin Number||Numerical||A – B – C|
(C= Cherbourg, Q= Queenstown, S= Southampton)
|Boat||The number of the Lifeboat||Numerical|
|Body||The body identification number in case of drowning and being found||Numerical|
|Home. Dest||The passenger address||Descriptive||48|
The missing values
Though the data looks organized and wonderful, there are some columns that contain a lot of blank cells. We can start looking to the missing values among our dataset. Usually the instant quick check is enough when working with a small dataset like our dataset. In fact, only by scanning the records, we can define the three fields with the missing values: Age, Cabin and embarked to estimate the effect of the missing values.
To determine the amount of missing values in each column the countblank equation can be used in the end of each column to count the number of blank cells in it, for instance the equation will be as follows in the end of the first column of data:
Then after that, we drag the equation horizontally to repeat it in the end of the column of data columns. There are 9 columns with a total of zero which means that they don’t have missing data. While the numbers vary in 5 columns, which are the age, the cabin number, the body number, the boat number and finally the passenger address.
The body column misses more than 90% of the data; it is the top column in the missing data. Then after it comes the cabin column, that misses about 80% of the values, then the Boat column that misses more than 60% of its inputs, followed by Home.Dest with more than 40% and in the last rank the Age column that misses 20% of its inputs.
There are many ways to fix the missing values. But before we decide which of them we would follow, we should think if we need to fix it or it is better to exclude the data columns that have missing values. There is a rule that we can work with called the rule of 30%, which states on eliminating the data columns that miss 30% or more of its inputs because the results of analyzing what is available from its input will definitely give distorted results, and that don’t give right indications about the total amount of data.
Imagine that, as if you are surveying the views of 100 persons about their average monthly salary and 30% of the sample decided to withhold their monthly salary, consequently the amount of answers that you got will represent only 70% of the sample in the survey and it doesn’t reflect the salary amount of the whole sample.
If we decided to work by this rule we will have one of the previous columns to work on trying to find a solution for fixing the missing data.
In some cases some ways like returning to the source and contacting him to know the reason for missing some data or searching for another source for the dataset, and comparing it with what we have, maybe the other source has a different methodology for collecting the data, and his database doesn’t contain this amount of missing data. And if these ways are not available we have to follow the statistical solution to fill the missing values with an average for the dataset.
Identifying the story hero
Before we start organizing and analyzing the data, we have to determine how important the available inputs are in the database. And usually we classify the data columns into three categories:
The hero, which is the column with the big positive effect that strongly relate to more than one column in the database. And maybe it is possible to generate new data columns from it. The second type is the supportive; it is the variable that doesn’t always play an important role in changing the data analysis path but it can enrich the results with more details and depth, and we can call it the secondary hero of the story as well. And in the end comes the hero adversary or opponent, which is the variable that causes noise and distraction by being in the data maybe because it contains a big amount of missing data or doesn’t enrich the core of the story.
The cards game that Jonathan suggested may help in renewing the roles of the data columns in the story, by trying to connect more than one column to determine if it will lead us to rich details in the story or not. Of course the business class and the status of the passenger play the protagonist roles in any story we can create; as we can discover the relation between the business class, the passenger age or the business class and the ticket price, maybe we find differences in prices for the passengers of the same business class.
The number of siblings or couples and the column of the number of parents play the secondary role as we can determine through them how big the size of each passenger family on the ship, particularly if they were the most lucky families in surviving or drowning.
While the importance of the numbers of the life boats and the body number decrease, as they have a lot of missing data and don’t make a core value to the story details.