Recently, there has been a large amount of news published regarding the ongoing conflicts in the Central African Republic and South Sudan. Jay Ulfelder has written fairly extensively about the mass killings that are ongoing in these states. Like Jay, I’m interested, generally, in the prediction of various “events-of-interest,” which is a heading that mass atrocities fall under. I’m also rather involved in using and evaluating the GDELT dataset for various purposes. So, when it became obvious that the conflicts in the Central African Republic and South Sudan had taken a turn for the worse, I wondered if it would have been possible to predict these events in advance. This also begins to fit nicely with conversations I’ve had on Twitter and elsewhere about the goals, evaluation, and implications of the various forecasts of politically-relevant events. I’ll have more to say in a later post about these issues, but I hope to start shedding a bit of light with an examination of prediction and data availability in GDELT.
The Signal and The Noise
One of the first things that I tell people about GDELT is that it is messy and noisy. This is not meant to be a dig at the dataset. This is just a reality of dealing with data on this scale that is generated by automated processes with little to no human supervision. One resulting aspect of this noisiness is a potentially high rate of false positives. Mike Ward and crew recently authored a blog post comparing GDELT data against ICEWS data. One conclusion of the post is that there are a high number of false positives contained in GDELT. The flip side of this, however, is that GDELT likely has a lower number of false negatives than other datasets; GDELT likely captures more events that actually happen.
Egypt Is Burning
Kalev Leetaru pointed out to me that the recent events in Egypt would provide another opportunity to visualize the GDELT data and how it tracks the protests. In addition, GDELT also provides the opportunity to track the government response to the protests. My previous maps only examined the presence of protest behavior across the world. For this set of maps, however, I focus instead on Egypt alone and include various government responses to the protest activity.
As always, the data is pulled from the GDELT dataset. The data runs from August 9th until the 17th, which I’ve used to create two separate maps. With each of these maps, in the upper right corner you can choose the layers that appear on the map. When both layers appear on the map, the clickable points show the information only for the protest events. If you disable the protest layer, you can see information for the additional violence or posture levels. Each point includes a record of number of events, location name, and the source URLs used to generate the events. As a note, CartoDB doesn’t play nicely with the long string of source URLs, so each URL is only separated by a space and must be copied and pasted into a browser. Finally, it is important to note that zooming in on the map is highly suggested; the GDELT dataset provides a fair amount of city-level detail in the geolocation of events and it is very interesting to see the spread of protests and government responses.
The first map displays protest behavior, subsetted from GDELT using root CAMEO
14, with an overlay of violent events as red circles. The violent
events are pulled using the root CAMEO code of
18, with the additional
stipulation that the target of the action had to be civilian in some way,
as indicated by a
CVL label in one of the
Actor2Type codes. This generates
the following map:
The second map uses the same protest data, but instead overlays CAMEO root
15, which indicates a change in military or police posture.
This results in
I leave the substantive interpretation up to the reader, but it is worthwhile to note that the change in military/police posture events are more widespread than the violent events. As expected, there is also a high concentration of violent events within Cairo.
One of the primary shortcomings of the original protest map I posted was that it only captured a static picture of protest activity over a 6 month span. While this was interesting in its own right, many people, including myself, were interested in how protest behavior changes over time. Given this, I decided to explore creating an animated version of the original protest map.
As I’ve mentioned before, the GDELT data covers
from 1979 until present day, with continuous daily updates. I’m making use of
a subset that runs through June of 2013. The same caveats of the data
that I noted in the
about the protest map still apply. When dealing with the time-series of data,
however, one additional, and very important, point also applies. The number
of events recorded in GDELT grows exponentially over time, as noted in the
introducing the dataset. This means that over time there appears to be a
steady increase in events, but this should not be mistaken as a rise in the
actual amount of behavior
X (protest behavior in this case). Instead, due
to changes in reporting and the digital recording of news stories, it is
simply the case that there are more events of every type over time. In some
preliminary work that is not yet publicly released, protest behavior seems to
remain relatively constant over time as a percentage of the total number
of events. This means that while there was an explosion of protest activity
in the Middle East, and elsewhere, during the past few years, identifying
visible patterns is a tricky endeavor due to the nature of the underlying data.
Finally, the data used to create the map can be viewed here. In order to reduce the data to a manageable size, only locations where 10 or more events occurred within a specific month are included. 10 events is an admittedly arbitrary cutoff, but given that the highest number of events for one location in a given month is 3,746 (Cairo in February of 2011) I feel that it is reasonable.
With these notes out of the way, the map can be viewed at johnbeieler.org/protest_mapping. The color of each point is scaled by the number of events, with darker circles indicating more events.
There’s been a fair amount of discussion in the news media recently about whether the start of Ramadan will lead to a decrease in protest behavior in locations such as Egypt and Turkey. While I don’t have a forecast to offer, I do have data from GDELT that tracks protest activity dating back to 1979. With the help of Wikipedia I was able to identify the beginning of Ramadan going back to 1979. Using these dates, it is then possible to plot the time series of protest behavior with the Ramadan time period layered on top. It is important to note that this is a rather quick and dirty analysis. For example, in 1981 Ramadan began on July 3 so I marked July and August as Ramadan months. It shouldn’t affect the visualization all that much, but I thought it is important to note. Taken together, this data produces the following plot (larger version available here):
A quick eyeball test seems to indicate that Ramadan often coincides with a
slightly lower level of protest activity over time. In order to shore
this up, a quick correlation of “Ramadan” with the protest data shows a
coefficient of about
-0.025, which indicates a very weak, negative
relationship between Ramadan and protest behavior.
Caveats, notes, etc.
With this said, it is important to note a few things about the data. First, this includes all protest activity over time. Events such as the US protesting the behavior of another state are included along with civilian protests. This data also includes protest data from the entire globe; it is quite possible that the effects of Ramadan are more pronounced in countries with a higher Muslim population. A final issue is the possible presence of seasonality in the data. Ramadan just might happen to coincide with time periods that experience lower protest activity, e.g., the summer months.
Since I complained on Twitter that the creators of time-series plots often fail to upload the data used to create the time series, the protest data is available here, while the dates for Ramadan I used are here. As a final note on the data, the values for the protests are percentages of total events that occurred in a given month. This controls for fact that later years, particularly those after 2005 or so, have more events due to changes in reporting, etc.
Edit July 17, 2013:
Since this map got picked up by the Guardian, I thought I would clarify some points further for those who aren’t familiar with the data.
First, the GDELT data is based on news reports from a variety of sources (a list of sources used can be found here under “Data Sources”). For better or for worse, journalistic accounts of events are about the best we can do for large-scale, global projects such as this. Second, if an event occurs but does not have a specific location within a country, e.g., “Protestors in Syria…”, the event is geolocated to the centroid of the country. This means that there may be some odd events at some locations, and with a high number of events. Third, the “Event Count” featured on the map is the number of protest events that occurred at that location for the entire first half of 2013. This means that if the “Event Count” variable shows 60, then there were 60 unique protest events at that location. This is not a measure of scale or intensity of a given protest, or even how many times a certain protest was mentioned in the news media, though GDELT does record this, it is simply a measure of unique events. Next, geolocation is hard, especially on the scale GDELT works on (300+ million events spanning over 30 years), so some of the points may not be perfect. Even if 10 million events are located in the wrong place, however, that’s still an error rate of about 3%. Finally, and this is mentioned in the post further below, GDELT uses the CAMEO coding scheme to classify events. This means that many different types of protest behavior are recorded, not just the protests or riots that come to mind when one thinks of Egypt or Turkey. Russia verbally protesting actions of the United States is a protest event. This means that there are both a higher number of events, and events that occur in locations that a person might not tie to a protest or riot.
GDELT and Protests
Given the recent spate of protests around the world, there was some discussion between Jay Ulfelder, Patrick Brandt, Kalev Leetaru, Phil Schrodt, and myself about the possibility of using GDELT to examine some of the protest activity. Much of this still remains in the discussion stage, but some data was pulled from GDELT, and I decided to venture into the world of map making. As a caveat, I’ve never really worked with geographic visualization of data, and this is my first cut at this type of work. So, without further ado, the map is located at http://cdb.io/14RHla0.
The data used to create the map contains all CAMEO codes that begin with
which is the general category for “protest” events, for the year 2013. Including
data from earlier than 2013 made the map much too cluttered.
A potential issue with the use of this CAMEO category is that it picks up governments protesting other governments, politicians protesting
policies, etc. Thus why the U.S. is blank; it was a shining beacon of protest
activity that distracted from the other parts of the map. If anyone is
interested I can put the U.S. data back in and regenerate the map. The data
was grouped by the latitude and longitude coordinates, and a count of protest
events at each location is included. If you zoom in, you are able to see the
individuals points, which when clicked provide information about the location
and number of events.
The main takeaway from this map seems to be that GDELT does a pretty good job of capturing the broad trends of protest activity; the areas that are “bright” are those that would generally be expected to be so.
Note on Tools
Processing Data With Hive
Working with big data is not a particularly easy task. There is a lot of commentary on the web about what constitutes “big” data. The GDELT dataset, which is the focus of this post, is over 40 gigabytes uncompressed, so this is not a discussion of “Google-size” data. It is, however, more than most social scientists are used to dealing with in one pass. I’ve attempted to chronicle my history of working with the GDELT dataset to draw interesting conclusions about the world using event data. I’ve been relatively successful so far, but I felt that it was possible to make the data easier to work with. Towards this end, I began to explore SQL and database technologies to use as a subsetting method. I finally landed on Hive, which is part of the Hadoop ecosystem. Hive allows you to run SQL queries (Hive’s language is actually called HiveQL) on top of the map/reduce framework for computation. The data is distributed (mapped) across multiple nodes in a server cluster and queries are run atomically on this set of the data. This distributed data is then recombined (reduced) back into a single output form.
Using Hive, it is possible to run fairly complex queries across the entirety of the GDELT dataset in roughly five minutes. This speed is possible thanks to Amazon’s Elastic MapReduce environment, which makes use of Elastic Cloud Compute (EC2) resources as the computational backend. EC2 makes it cheap and easy to rent a large cluster of servers for cheap; as an example, I have used 40 servers at ~$0.10/hr per server. Thus, the combination of Hive and Amazon Web Services makes it remarkably easy to get up and running with quick queries over this very interesting dataset. The rest of this post shows you how.
Easier subsetting of data
Yes, another post about GDELT. But this one can apply to other datasets, too.
In an earlier post,
I wrote about how to start subsetting the GDELT data
using Python. Others also wrote similar
pieces. Each of these posts used the
same basic idea: iterate over each line of the dataset, split the line based on tabs,
and select the lines that have fields that match some criteria. This was all well and
good, especially when working with the reduced dataset. The release of the full GDELT
data, however, complicates matters somewhat. Whereas the reduced dataset only has 11
fields of data, the full dataset contains 56 or 57 fields, depending on which set
of the full data is under examination. On top of this, I have noticed that when writing
more complex subsetting scripts it is often easy to lose track of the rules for
selection. These rules are also obfuscated in the Python code for splitting and
selecting. What was field 35 again? Suffice to say that I have become tired of writing
subsetting scripts. A second development is my growing using of SQL resources, including
those such as SQLite and Hive for Hadoop. I have found that these resources make parsing data
much easier, and I will have more to say about these technologies, specifically Hive and
Hadoop, in a later post as some projects I am working on develop further. But, currently,
it is possible to make use of SQL queries while still remaining in the Python ecosystem
and making use of fantastic libraries such as
pandas. All while avoiding the actual
setup of a SQL database.
Yesterday, I started downloading the GDELT data from the website,
having previously pulled the data from the servers at Penn State.
Having to navigate the website and download each file individually
caused me far more frustration that it should have, not due to the
design of the website but due to my own impatience. I’ll have more
to say about the general state of data distribution in the social
sciences in another post, but for now it’s enough to say that I’m
not a fan of downloading data by hand. Because of this, I wrote some
scripts in Python to help with downloading the data. There are two
download either the historical files or the continuously-updated
daily files. The code is on github.
I’ve copied the contents of the README file below so you can determine
if the scripts would be useful for you.
This semester I’m taking a class on terrorism. Overall I’ve found the class very enjoyable; the topic of political violence is one that is always fascinating. With that said, I found one issue that would repeatedly pop up in the readings for the seminar. Almost every paper would demarcate some type of theory, discuss the data used, and run some statistical tests, which is pretty standard social science research. The issue arose in the “Discussion” or “Conclusion” sections. Almost invariably the authors would discuss the practical implications of their research, which is fine until the dreaded prediction word appears. Then claims about the predictive accuracy of the models used in the paper would rear their ugly heads. These models were explicitly not predictive models. This became the soap box that I would drag out repeatedly throughout the semester. Finally, I decided to put my own assertions to the test and see how some models performed on out-of-sample predictive tests.