GDELT
Over the past week, the Global Data on Events, Location and Tone (GDELT) dataset was finally released to the general public. The data is available at the Penn State event data website. We at Penn State had the good fortune to have access to this dataset for many months before its public release. This allowed us to gain some experience working with this massive collection of data. As a brief background, GDELT is comprised of event data records spanning 1979 - mid 2012. The events are coded according to the CAMEO coding scheme, with the addition of a “QuadCategory,” which separates the events into the material conflict, material cooperation, verbal conflict, and verbal conflict categories. The data is spread across 33 different files, each of which is substantially large on its own. This makes it fairly difficult to work with, and almost guarantees that some subset of the data is necessary in order to perform analysis. Phil Schrodt has included some programs with the data to aid in this subsetting, but I thought there might be some who would prefer to get their hands dirty and write some of their own code. Given this, I thought I would share some of the knowledge I gained while working with the GDELT dataset.
For the purposes of this brief introduction, I will work under the assumption
that the desired events are those that originate from the United States, are
directed at some type of state actor, and are either verbal cooperation or
conflict. The following code, written in Python, demonstrates how this subset
might be selected from the GDELT data. The code also assumes the reader
has the pandas and path Python modules installed. Both can be installed using
the normal pip install method. Finally, the complete code is available as
a gist on github.
Before starting, it is always useful to take a peek at the data. This is as
simple as opening up the terminal and using head 1979.reduced.txt. Doing this
shows the various columns in the reduced data and how they are arranged. We
can see that the date is in the 0 index, with Actor1Code and Actor2Code
in spots 1 and 2, respectively. Additionally, EventCode is located in spot 3,
while the QuadCategory variable is, fittingly, in position 4. These indices
will prove crucial when it comes time to split and subset the data.
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
This first portion is simply importing the necessary modules, defining the list
of state actors, obtaining the relevant filepaths, and defining an empty list,
which will serve as the container for the subset of the data.
As a brief note, the path module is really fantastic and makes working with
filepaths and directories extremely simple.
1 2 3 4 5 6 7 8 9 10 | |
The next portion of code iterates over the filepaths obtained using path.
Each file is then opened and iterated over line by line. Each line
has any new-line characters replaced, which makes the data easier to work with,
and is then split on the basis of tab characters (‘\t’). The following four lines
define the logical conditions for subsetting the data. The first condition
indicates that the first three characters of Actor1Code should be
‘USA’, while condition2 states that the first three characters of
Actor2Code should not equal ‘USA’. condition3 checks if the first three
characters of Actor2Code are in the allActors list defined earlier, while
condition4 checks if the QuadCategory is one of the desired values.
1 2 3 4 5 | |
The above code simply checks if all of the various conditions were met, and
if so appends the split_line to output. This code is wrapped in a try-except
statement since there can be some malformed lines floating in the data, but this
should not affect the actual event data. The try-except statements allow for
an attempt at a certain block of code, and if an error is raised, in this case
an IndexError, for some other actions to occur.
1 2 3 4 5 6 7 8 | |
Once the code from the previous sections is finished, the subset of the data is
contained in output, which is a list-of-lists with the internal lists representing
the individual events. It is possible, using the pandas library, to convert this
list-of-lists to a pandas DataFrame object, with the header drawn from the first line
of the first file in filepaths. In order to aggregate the data to a specific time period,
it is useful to break out the Day variable into months and years using the .map
functionality of a Series object in pandas. The .map functionality is combined
with a lambda function to select eaach observation in the series and slice it
in order to obtain the year and month. Finally, the last two lines of the above
code reduce the data down to only the columns relevant for this subset.
1 2 3 4 5 6 7 8 9 10 11 | |
Now that the data is in a properly formatted DataFrame, the next step is to create
variables from which to draw counts of the various event types. Two variables,
verbal_coop and verbal_conf are created and assigned values of zero. Then,
these variables are assigned values of one if the QuadCategory matches the
value for that event type. This functionality in pandas is similar that of
R, and I plan on doing a more in-depth tutorial on pandas at a later date.
With the event variables created, the data can be grouped and aggregated. pandas
has the .groupby functionality for DataFrames, which allows you to group the data
by specific variables. For the purposes of this dataset, a multi-level grouping is
desired, with groups created by the year, the month, and the dyad. Once this grouping
is created, the values are summed within each grouping leading to the final, aggregated
dataset. This final data can be saved using the .to_csv method.
And there you have it, you now have a subset of the GDELT dataset.
This is a relatively simple task thanks to Python’s built in tools. It is
possible to make this task run more quickly using parallel processing, as
outlined in my previous post
on parallel subsetting. As a brief recap, it is simply
a matter of using the jobilb module, wrapping the above subsetting code in
a function, and adding something along the lines of
data = Parallel(n_jobs=-1)(delayed(subset)(x) for x in list_of_paths)
to the script. I hope this brief intro will prove helpful to those who wish to use GDELT in their own research. This tutorial only scratched the surface of working with event data, and there is much more to consider beyond just what subset you will select. A good resource on working with event data was written by Jay Yonamine and will likely prove useful to those engaging in event-data research.