Details of Prospecting in Hundreds of Cells and Documents
With the beginning of a new year, we would like to share our readers and colleague journalists the details of a number of data driven stories, which we published during last year, 2018. Such stories, as a whole, won the award of the best team specialized in Data Journalism, which was presented by Global Editors “GEN” Network for year 2018.
We learned much on personal level through the past year; we gained much experience in Data Journalism field not only on the level of tools and general rules but also in relation to techniques and tricks of collecting and dealing with data; we also revealed the distinguished stories they contain, methods of their processing, arrangement and visual narrating.
Although we didn’t publish many stories last year, we focused on producing purposeful inspirational data-supported journalist stories without relying on the daily news. Last year, we also innovated a number of our special techniques to overcome many obstacles concerning the format of available data. Such obstacles hindered us from getting organized and coordinated data in the form of EXCEL tables which should be fit for analysis, filtration and tabulation to help us use outcomes in the form of journalist stories offered through visual elements that help narration of the revealed information.
In the below-mentioned statements, we are going to put forward the scenes of some of our produced stories, problems that we met and solutions, and the used tools to present and narrate such stories visually.
Murder in The Arms of Marriage, Story of 222 CasesRe-rotation of news to create a different story
Domestic newspapers and websites usually deal with violence crimes routinely. Such kind of crimes almost happen every day as we hear about a husband who kills his wife and vice versa; hence, we have looked for the reasons and methods of committing such crimes in spite of having no available official information. The daily news was our only source; therefore, we decided to rely on Seventh-day Website (Youm7) and look into its archives to collect the greatest number of family violence crimes news.
From news archives to organized database
We collected all news published by site from Jan.1, 2014 to Nov.21, 2017 using a tool presented by “import.io” site, to extract news during the period of study or examination by entering keywords such as “Husband- Wife- Murder.” This tool provided us with 600 news about murdering and murdering attempt crimes, which were published during the previously mentioned period.
After precise overview of the 600 cases, we found that they contain some news about crimes, which were committed in countries other than Egypt; moreover, they contain fictitious crimes. Such news was handled by “import.io” site while presentation of details and news of cinematic films and series stories. In addition, we found multiple news for one crime which was probably a follow up of consequences and lawful procedures of the same crime.
After auditing the 600 news, omitting repeated ones, excluding news about crimes committed outside Egypt and disregarding art news, we excluded almost two thirds of the 600 cases and kept 222 news at a rate of one news for one case.
Manual construction of database
In this phase, we converted the extracted data into more organized form and tabulated them in EXCEL tables which can be analyzed and filtrated because the extraction tool provided us with news links, a full paragraph of the introduction of each news containing basic information in the form of a linked text; consequently, the next step was dividing the work team into two groups; the first one dismantled such texts into single data to extract its significant information such as jobs of both victim and perpetrator, their ages, cause of conflict and the method of killing then putting them in columns to build a database; the second group revised the first group’s work, exclude repeated data and confirming the accuracy of the first entry operation.
Answering the basic questions
After we had built precise arranged clear tabulated form of data, we started another phase of analysis and linking to extract results which were astonishing not only for the reader but also for us. The results included answers to the basic questions such as: who commit more murder crimes, husband or wife? At which age? What are the reasons of committing crimes against each other? What are the methods of killing or killing attempt used by each type?
You may not be surprised if you know that the husband killed his wife in 147 cases and the wife killed her husband in the other 75 cases; if we divide this number on the data- covered months, we will find the average of killed, or exposed to killing attempt, 3 wives per month killed by their husbands, which means one victim every 10 days.
Egyptian President’s Decisions During the First Presidential Period
Available non-digitalized plentiful data
We are in front of dozens of kilograms of paper that are the weight of several huge bundles of monthly issues of the official gazette in addition to some several supplements, which are full of enormous amount of governmental decisions. Such decisions were approved by Egyptian President and Egyptian Prime Minister according to the Egyptian legislation which decrees that the president must be the last phase in issuing laws referred by Egyptian Council of Representative; hence, they will be published in the official gazette to be applicable.
Through this story, we aimed at the analysis of the first period of Sisi, The Egyptian President through the laws, from June 2014 to Dec. 2017, that he approved and was published in the official gazette to be applicable.
Conversion of printed into digital
It was long hard difficult mission because we dove in a muddy pond full of solidly printed texts which required a big effort to convert them to organized tabulated analyzable filterable data. After we had finished the 3-month mission of editing and auditing, we found that the sixth president of the Arab Republic of Egypt issued 1,178 presidential decisions during the above-mentioned period which were equal to total number of approved decisions through his first presidential period. Such decisions are available on our computers now and we are able to prospect in and find out hidden details.
Strategy for arranging and revision of data display
We used two methods to analyze such huge amount of data; the first one was arranging them from oldest to newest to know the number of decisions in light of time development and determination of times which witnessed the intensity of issuing laws; the second one was classifying data into principal classes Such as economic decisions, administrative decisions, and political decisions. The principal classes were divided into more precise and more obvious sub-decisions; for example, whether such decision that was issued at certain date is classified as one of economic decisions related to State General Budget or as one of administrative decisions related to designation or appointing an official to a certain position.
Display of data per request
Last challenge was how to display this vast number of decisions to the reader and whether he can read every decision without feeling boring due to the great amount of data in each decision; therefore, we converted such great amount of unarranged data into classified data to be available in interactive visual manner allowing the reader or user to view not only a total clearer picture of data but also a detailed one at the same time. For building the visual shapes used in displaying data, we relied on very small units where each unit represented one decision with all its details.
Social media as a material of investigation
Recently, social media have been used in a different way; instead of using them to evade the reality by chatting, exchanging comics and funny pictures, they have been used in politics conflicts. Such attitude drew our attention especially in the last June and July when hashtag campaign broke out and occupied the top of “Trend List” on Twitter.
Such hashtags reflected a conflict case between the opponents of The Egyptian President, Abd El Fattah El Sisi and his supporters where both sides published a number of hashtags of opposite meanings and different ways of writing or formulation. The hashtag campaign has continued for several days where each side tried to draw attention of the largest number of “twitter” users to support his hashtag.
Because the political conflict moved to “twitter” site, we began to follow up the issue through focusing on the recent wave, monitoring and analyzing the two big blocs which were formed under two different hashtags; “GO SISI” and “SISI WILL NOT GO.”
Starting point…software code
During the 2-month period of inquiry, analysis and preparation of story, our team was able through June to collect a sample of 11,268 tweets that represented the opposing hashtag containing “retweet- reply to tweet- directed tweet” and another one of 2,570 tweets that represented the supporting hashtag using a specific code written by “R” programming language. Such code helped us get tabulated data that contained time and date of publishing of each tweet, written text, media file links “images- videos,” number of actions per each tweet and name of tweeting account.
Our team relied on integration which eased our mission to extract basic data for our story because there was not, unfortunately, a free tool to extract the basic data. For this reason, we improved that specific code to be used on a large scale and we made it available for the rest of team members and non-programmers to use for other thoughts that need inquiry through “twitter”
Seeking unfamiliar things
The collected data formed two big blocs; each one contained small blocs of different sizes. We needed to split such blocs to know their accounts and whether the accounts are personal or programmed “bots,” therefore, we used the “Botometer” tool, which is a free tool developed by “Network Sciences Institute” of Indiana University.
Such tool helped us know whether the account is programmed “bots” through evaluating it using a variety of standards to get a total average value between zero and 5; If the value approaches zero, it is a personal account, If the value approaches 5, it is probably programmed “bots.”
The mentioned tool significantly helped us reduce the circle of suspicion, but there were many suspicious accounts due to its large activity if compared with a personal account in terms of intensity of tweeting, attitude and other things, which made us stop for a while to decide to put our particular standard to discriminate between personal accounts and programmed ones “bots.”
Alone applications are useless
We have learned great wisdom in such issue; when a tool didn’t meet our objective and confused us, we determined many standards to judge accounts’ identity concerning account’s date of creation, time link of the account between the promoted hashtag, and attitude of account’s incoming tweeting.
Sometimes we find some accounts republish tweets that are against their attitudes “supporting- opposing” just because they used the hashtag which represents account’s general attitude. It means that such accounts are programmed to republish tweets that contain a specified hashtag regardless tweets concept, whether the hashtag backs the account’s political attitude or not; in addition to account’s name whether normal name or symbol-compound one, the used picture, the ratio between number of who are watched by the account to the number of account observers; it is illogic to find thousands of observers to a certain account, which in turn that account just observe one or two accounts.
After a large number of filtrations, classification, and tracking, we found out some programmed accounts “bots” of effective activities in such event with its two attitudes “opposing- supporting;” such activities were dangerous because the programmed accounts reduce the chance of running real credible discussion on social media via the internet.
Koshary Is A Measure of Inflation
Converting data into an understandable language to the public
This story is the best example to show the importance for the journalist to know the ideal way to address public and simplify highly specialized terms, which the majority of people don’t prefer. For example, we wanted to gauge the size of inflation in Egyptian market; hence, we looked for an attractive index of inflation and we found that the best index was Koshary (An Egyptian public meal consists of rice, lintels, macaroni, hot sauce, and spices), which many Egyptians of different economic classes prefer to eat.
Methodology of choice
Koshari meal has unique specifications which are different than other Egyptian public meals, where it contains rice, macaroni, lentils, chickpeas, onions, tomato sauce, garlic, and oil. Such components helped us measure changes occurred to a large number of public variant commodities, which is similar to the inflection index that reflects the change in prices of a group of consumer goods.
Questionnaire as a data source
In the beginning, we wanted to know who dominate Koshari market in Egypt; therefore, we asked our public on social media to participate in the questionnaire which we prepared to determine the big five who dominate this trade, rate of having such meal, age and income of participants; finally, we got tabulated data of 240 persons.
Analysis of results led us to the next step
At this phase, we analyzed questionnaire results and we found that more than 80% of participants chose 5 stores; consequently, we went to these stores and bought a unified price category can of koshary from each store, we brought them to InfoTimes newsroom to start another new more inventive phase. We weighed each can then we emptied all cans and we weighed each content. We got exciting results; for example, we found that one of the well-known stores sold to its customers half weight of a can with the same price that another competitor store sold a full weight can.
Innovation of a specific measuring tool
We moved to another phase in this step, we decided to fill InfoTimes Koshary can with the average of weight of the five cans’ contents to make a public index to measure inflation according to official data, from Central Agency of Mobilization and Statistics, about the prices of the commodities used in making koshary such as rice, macaroni, lentils and others through a continuous time series from 2000 to 2016.
Using such data in addition to digital results and analysis, we presented a calculation tool to public to enable them to know the total purchasing value of koshary can’s contents, in any month of the period covered by data, by entering specific data. At this phase, we used an attractive method to display data through reader participation and allowing him to do what we call “playing with data.”
Detention or Death Sentence?
Interactive narration of an available accumulated data
The production of a data-supported journalist story looks like finding a diamond stone in a coal mine. It is a hard work and we sometimes have to sacrifice a large number of probable stories and distinguished ideas to produce one story. It’s not easy and we are not sure of getting a good result every time. It is sometimes better to look for a “stroke of luck” as if a push to rotate the wheel or easy and speedy encouragement to continue working.
Such “stroke of luck” was our story about data of prisons’ deaths which were provided by “Daftar Ahwal” organization (“Daftar Ahwal” is a record found in Egyptian prisons to register the status of prisoners, it is translated as “Status Book”). Such organization collected, archived and documented all data of societal issues; it has a Cairo-based research center which started an independent initiative on July 12, 2015, where it prepared a study about deaths in police detentions and prisons in Egypt from 2011 to 2016. Through this period, it recorded 800 deaths and classified them according to cause of death, gender and geographic range in which the prison or detention is located.
To display data in an easy to use, visual and an interactive way, we used free “flourish” tool, which allows the display of data in several interactive forms. Such display enabled public to understand many details related to the 800 deaths more easily and more quickly, periods of the most and least deaths, main cause of death, sex type (males- females) who had the most deaths, and the governorate of the most and least deaths.
Maps of Presidential Elections Abstention and Invalid Ballots
Search the other side of the story
Most of Media enterprises, concerning covering the results of elections, usually focus on the participation rate without paying attention to the boycotting rate and the annulled voices rate. When we put the boycotting rate and annulled voices rate alongside the participation rate, we form a more comprehensive and clearer picture of voting trends and the broader scene of the political process. For example, the latest presidential elections held in April 2018 that resulted in the winning of The President Abd El Fattah El Sisi with approximately 24 million votes, out of 59 million constituents, who chose to go to ballot boxes, which means that 58.9% refrained from voting. Accordingly, we preferred to search the dark side of the picture; annulled voices, boycotting places, to determine the change or voting trends.
Exploitation of available data
This time, it was not difficult to get data or tabulate in EXCEL tables, because we relied in this story on the National Elections Authority as a source of data; it is an independent authority which is responsible for conducting and organizing referendums and elections in Egypt according to the constitution, it has a website that gives free and detailed database for electoral processes and detailed results of all cities and stations in Arab Republic of Egypt.
Journalists usually care about official data issued by the National Elections Authority while covering electoral events without searching in the organized EXCEL tabulated data provided by such authority.
Tracking of changes and determination of trends
Because we put the collected data in a comparison context, we got more obvious picture of voting trends and rates of changes in each governorate. We found the governorates of high participating rates, had a high rate of annulled voices which means that the participation trend was negative on the contrary of what many media enterprises emphasized. Such new dimension of the electoral scene was achieved due to looking into details hidden by data.
Our readers were able to see the change in trend by the comparison of map of electoral votes for each governorate in the electoral processes to know its participation rates and the number of annulled voices too.
Again, we relied on the free tool “flourish,” which allowed the display of data in several interactive forms. Such interactive visual display enabled us and our public to understand more easily and more quickly many details related to boycotting trends and added a new dimension to the voting picture; for example, we found that some governorates which didn’t record high participation rates in previous processes, recorded remarkable increase in participation rates in subsequent electoral processes; on the other hand, some governorates which recorded high participation rates, recorded high rate in annulled votes too.