The headline might have surprised many of you, as some must have expected that the data analysis phase comes after exploring the data. In fact, the analysis comes later, preceded by the data processing phase that includes data organizing and cleaning, and then comes the “interviewing” phase that includes asking the questions and interviewing the data to get to the information that might lead us to a news story.
90% of the time consumed in producing a data-driven news story goes to processing and preparing the data for the “interview”. This is a no-skip phase, as it might result in a misleading and inaccurate news story because random data will definitely lead to wrong analysis, and consequently wrong conclusions.
Before we start, make sure that you understand the data and have all the related information. The information can be knowing the data original source; determining the identity of the organization/association that produced the data, determining whether it has collected the data itself or through another source; determining whether the data was aggregated or disaggregated; knowing how old the database is and whether or not it has been updated after it was first published. Does the data record have a tab explaining the methodology of collecting the data; and finally the possibility of collecting the data from another resource, through other procedures, like advanced search on search engines or web scraping.
Organizing the Data
A first glance at the data table of the passengers of the Titanic, you might think that everything is fine and the data is organized. And of course it is organized, because it is in a table of columns and rows. However, organizing the data here doesn’t only mean to organize its structure but also to organize the methodology, with which you work on the data.
The first thing that should be done is to add a new worksheet in MS Excel. Name it “Resources”. In that worksheet, insert the link from which you got the database, date of publication, source name, how it was collected and a clear definition of the included abbreviations and terminologies. These inputs are called Metadataـــ the descriptive data of the database. Some think tanks and governments include this section in their data tables. However, if you couldn’t find it, you should create one before anything else.
| Include the following information:
Then, make a copy of the data in a new worksheet so that you would have a pack up. It’s a very cautious procedure because you might do some changes (adding or deleting) to the data table, and at some point you might need to go back to the original data which might be hard if you don’t have a pack up.
For further cautiousness, you can copy the Excel file and save it in a another place on your personal computer. You may feel that there’s no need for that, given that the data is available online and that you have a link to it. But remember: easy come, easy go. The data table might disappear from the internet for several reasons. The most simple reason could be that the website gets blocked or broken. The source might do amendments to the structure of its hyperlinks, and therefore the link will take you to a broken page.
Now, look at the data table. We have a worksheet named Source, another one named Raw Data and a final one named Copy.
In case the input amount in the data table is bigger than the size of your screen, you can finish up organizing the data with some additional steps like freezing the first row and the first column that have the tabs, so that you can move in the data table flexibly to the right, left, up and down without losing the ability to see the data tabs wherever you move.
You can also hide the unnecessary columns and rows. Do you remember, in the last blog, when we agreed on deleting the inputs with a high proportion of lost data like the columns of the drowned corpses’ number, the lifeboats’ number and the one with the final destination of the passengers.
To freeze the first row and the first column, choose Freeze Panes from View. Then, choose Freeze Top Row to freeze the first row and Freeze First Column to freeze the first column.
To hide the columns that have unnecessary data, select the columns by clicking on the column’s heading, move between the columns pressing Ctrl so that you can choose the remaining unnecessary columns. After the selection, click right and choose Hide.
Of Course, there are more and more steps you can take for organizing the data. Ms. Excel has several features related to organizing the data like aligning the inputs, reorganizing the columns so that it would be consistent with the context within which you want to see the data. They are all optional steps, they differ from a database to another, while the previous steps are considered to be a fixed methodology that should be followed with any future database.
We can define the data cleaning process as a process of exploring and correcting (or erasing) the invalid or inaccurate inputs of a set of records, tables or a database. And also defining the parts that are incomplete or unrelated to the data then replacing, modifying, or deleting them from the database.
It’s likely that when we get a set of data from any source we will find some mistakes, that occured during the data entry process. It is rare to be lucky enough to leave this stage behind except in one case; that it was you who collected the data and tabulated it yourself.
Before you start cleaning the data, add a new worksheet to the data classifier and call it “Data Diary”. Document through it all the steps you made for cleaning the data. This step is considered to be one of the best practises to deal with the data. It only requires little effort. It’s not necessary to make a big space documenting each step. It is enough to write some short sentences to give you or anyone else enough information to understand all the procedures done to process the data.
The process of cleaning the data is about the ability of the journalist to determine the defect in the database as each problem has more than one way to be solved, and each problem is different from the others. It may require using software special for cleaning the data such as; Open Refine or programming languages like Python or R in case the data size that you deal with is too big and exceeds the power of Excel to deal with it, or if processing it with the conventional ways takes a lot of time.
In the case of the ship Titanic that we have now, all the mistakes in the data table can be processed and cleaned using so simple solutions that are available in Microsoft Excel. The following table the Types of mistakes and how to solve them:
|Going through each column and reformatting it from the Numbers section in the main menu, and choosing a description for the inputs that is appropriate for the nature of the inputs in it.||Pointing at any column in the data table, we will find that the cell that appears in the Number section in the main menu of Excel appears to be General, while the more accurate is that Text appears in the columns that contain texts and Numbers in the columns that contain numbers.||The format of the columns is inconsistent with the nature of the data included.|
|Here we can use the feature of converting the text to columns that allows us to split the inputs into new independent columns by defining the inputs separation method according to the mark that separates them, either it was a comma, a space or anything else.||Notice that the column of the passengers includes three different inputs which are: The first name, the nickname and the surname. The rule in Excel states that each single cell must contain a single input.||Splitting parts of the inputs into new independent columns.|
|The issue can be solved in many ways, the simplest is to filter each column according to the listed values and replacing it with the proper word to describe it.
For example, filtering the first column and choosing only number 1, hence replacing the 1 with the “First Class” and successively like that with all the columns that contain illogical encoding.
|The data contains a number of special abbreviations in the first and second columns, where the classes appears in a numerical form. Number 1 refers to the first class and 2 refers to the second class and so on. The drowned takes the code 0, while the survivors the code 1 in the second column of the passenger status.
They are abbreviations expressed but inefficient as the nature of the inputs in those two columns are descriptive data not numerical. There is no point from adding all the zeros of the drowned passengers because the aggregate will definitely be zero.
|Modifying the abbreviations|
|There are many different ways to solve this issue:
|The age column contains many missing values that makes it difficult to do the future analysis process which depend on the work variable as a main variable in the analysis process.||The missing data|