Prediction and Data Availability

Recently, there has been a large amount of news published regarding the ongoing conflicts in the Central African Republic and South Sudan. Jay Ulfelder has written fairly extensively about the mass killings that are ongoing in these states. Like Jay, I’m interested, generally, in the prediction of various “events-of-interest,” which is a heading that mass atrocities fall under. I’m also rather involved in using and evaluating the GDELT dataset for various purposes. So, when it became obvious that the conflicts in the Central African Republic and South Sudan had taken a turn for the worse, I wondered if it would have been possible to predict these events in advance. This also begins to fit nicely with conversations I’ve had on Twitter and elsewhere about the goals, evaluation, and implications of the various forecasts of politically-relevant events. I’ll have more to say in a later post about these issues, but I hope to start shedding a bit of light with an examination of prediction and data availability in GDELT.

Event Prediction and GDELT

While much of the discussion of predictions tends to center around the algorithms and methods used to make the predictions, a much more salient issue for forecasters of political events is data availability. It’s fairly well known that political data is hard to get, but many are hopeful the GDELT data will change this since GDELT is created through a fully automated process, is updated daily, and has a broader coverage than probably any other political event dataset. GDELT is reliant on media coverage, however, which creates a specific set of issues beyond the infrequent data updates that human-curated projects experience. Given this, my first step when asking if it was possible to predict events in CAR and SSD was to examine the data availability for these two countries. The results were a bit disappointing. The below plots show the daily counts of the four major categories of events that are often used in this type of work: verbal cooperation, material cooperation, verbal conflict, and material conflict. The first plot shows data for the Central African Republic for the entire date range that GDELT covers, while the second shows a truncated range that takes into account the sparsity of data in the earlier years of GDELT’s coverage.

These plots show the relatively weak coverage that even a comprehensive dataset such as GDELT has for more remote countries such as the Central African Republic. This isn’t a problem that is easily overcome, either using news media or crowd-sourced data, as Jay Ulfelder points out in a great blog post. The story is similar for South Sudan, as the figures below show.

Given this, what causes me to worry about forecasting political events is not particularly the accuracy of the algorithms, or the things that we forecast (when the appropriate conditions exist, we do fairly well). My main worry focuses on data availability; you can’t forecast something for which data doesn’t exist. It’s an interesting conundrum since the places that are more “interesting” in terms of forecasting are also those that will have poor coverage, and thus poor data. I don’t have an answer for this issue, and I don’t really think an easy one exists. I do believe, however, that this is a major hurdle for predictive analytics in the social sciences.

Edit (12/31/13):

Erin Simpson noted that it might not be immedietly obvious how little data actually exists for these two countries if one isn’t familiar with data from other countries. I believe this is perfectly valid, so the two figures below show the data for Kenya, which is another country that doesn’t have particularly good coverage, and Egypt, a country that has much better coverage than any of the others mentioned in this post. The plots show the data from 2011 onwards, and the data used is available here.

Data

The data used to make the figures in this post is available here.

John Beieler

PhD Student in Political Science

Prediction and Data Availability

Event Prediction and GDELT

Data