John Beieler

PhD Student in Political Science

How Do I GDELT?: Subsetting and Aggregating the GDELT Dataset

GDELT

Over the past week, the Global Data on Events, Location and Tone (GDELT) dataset was finally released to the general public. The data is available at the Penn State event data website. We at Penn State had the good fortune to have access to this dataset for many months before its public release. This allowed us to gain some experience working with this massive collection of data. As a brief background, GDELT is comprised of event data records spanning 1979 - mid 2012. The events are coded according to the CAMEO coding scheme, with the addition of a “QuadCategory,” which separates the events into the material conflict, material cooperation, verbal conflict, and verbal conflict categories. The data is spread across 33 different files, each of which is substantially large on its own. This makes it fairly difficult to work with, and almost guarantees that some subset of the data is necessary in order to perform analysis. Phil Schrodt has included some programs with the data to aid in this subsetting, but I thought there might be some who would prefer to get their hands dirty and write some of their own code. Given this, I thought I would share some of the knowledge I gained while working with the GDELT dataset.

For the purposes of this brief introduction, I will work under the assumption that the desired events are those that originate from the United States, are directed at some type of state actor, and are either verbal cooperation or conflict. The following code, written in Python, demonstrates how this subset might be selected from the GDELT data. The code also assumes the reader has the pandas and path Python modules installed. Both can be installed using the normal pip install method. Finally, the complete code is available as a gist on github.

Before starting, it is always useful to take a peek at the data. This is as simple as opening up the terminal and using head 1979.reduced.txt. Doing this shows the various columns in the reduced data and how they are arranged. We can see that the date is in the 0 index, with Actor1Code and Actor2Code in spots 1 and 2, respectively. Additionally, EventCode is located in spot 3, while the QuadCategory variable is, fittingly, in position 4. These indices will prove crucial when it comes time to split and subset the data.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from path import path
import pandas as pd

allActors = ['AFG', 'ALA', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'AIA', 'ATG',
            'ARG', 'ARM', 'ABW', 'AUS', 'AUT', 'AZE', 'BHS', 'BHR', 'BGD',
            'BRB', 'BLR', 'BEL', 'BLZ', 'BEN', 'BMU', 'BTN', 'BOL', 'BIH',
            'BWA', 'BRA', 'VGB', 'BRN', 'BGR', 'BFA', 'BDI', 'KHM', 'CMR',
            'CAN', 'CPV', 'CYM', 'CAF', 'TCD', 'CHL', 'CHN', 'COL', 'COM',
            'COD', 'COG', 'COK', 'CRI', 'CIV', 'HRV', 'CUB', 'CYP', 'CZE',
            'DNK', 'DJI', 'DMA', 'DOM', 'TMP', 'ECU', 'EGY', 'SLV', 'GNQ',
            'ERI', 'EST', 'ETH', 'FRO', 'FLK', 'FJI', 'FIN', 'FRA', 'GUF',
            'PYF', 'GAB', 'GMB', 'GEO', 'DEU', 'GHA', 'GIB', 'GRC', 'GRL',
            'GRD', 'GLP', 'GUM', 'GTM', 'GIN', 'GNB', 'GUY', 'HTI', 'VAT',
            'HND', 'HKG', 'HUN', 'ISL', 'IND', 'IDN', 'IRN', 'IRQ', 'IRL',
            'IMY', 'ISR', 'ITA', 'JAM', 'JPN', 'JOR', 'KAZ', 'KEN', 'KIR',
            'PRK', 'KOR', 'KWT', 'KGZ', 'LAO', 'LVA', 'LBN', 'LSO', 'LBR',
            'LBY', 'LIE', 'LTU', 'LUX', 'MAC', 'MKD', 'MDG', 'MWI', 'MYS',
            'MDV', 'MLI', 'MLT', 'MHL', 'MTQ', 'MRT', 'MUS', 'MYT', 'MEX',
            'FSM', 'MDA', 'MCO', 'MNG', 'MTN', 'MSR', 'MAR', 'MOZ', 'MMR',
            'NAM', 'NRU', 'NPL', 'NLD', 'ANT', 'NCL', 'NZL', 'NIC', 'NER',
            'NGA', 'NIU', 'NFK', 'MNP', 'NOR', 'PSE', 'OMN', 'PAK', 'PLW',
            'PAN', 'PNG', 'PRY', 'PER', 'PHL', 'PCN', 'POL', 'PRT', 'PRI',
            'QAT', 'REU', 'ROM', 'RUS', 'RWA', 'SHN', 'KNA', 'LCA', 'SPM',
            'VCT', 'WSM', 'SMR', 'STP', 'SAU', 'SEN', 'SRB', 'SYC', 'SLE',
            'SGP', 'SVK', 'SVN', 'SLB', 'SOM', 'ZAF', 'ESP', 'LKA', 'SDN',
            'SUR', 'SJM', 'SWZ', 'SWE', 'CHE', 'SYR', 'TJK', 'TZA', 'THA',
            'TGO', 'TKL', 'TON', 'TTO', 'TUN', 'TUR', 'TKM', 'TCA', 'TUV',
            'UGA', 'UKR', 'ARE', 'GBR', 'USA', 'VIR', 'URY', 'UZB', 'VUT',
            'VEN', 'VNM', 'WLF', 'ESH', 'YEM', 'ZMB', 'ZWE']

quad_codes = ['2', '3']

filepaths = path.getcwd().files('*.reduced.txt')
output = list()

This first portion is simply importing the necessary modules, defining the list of state actors, obtaining the relevant filepaths, and defining an empty list, which will serve as the container for the subset of the data. As a brief note, the path module is really fantastic and makes working with filepaths and directories extremely simple.

1
2
3
4
5
6
7
8
9
10
for path in filepaths:
    data = open(path, 'r')
    print 'Just read in the %s data...' % path
    for line in data:
        line = line.replace('\n', '')
        split_line = line.split('\t')
        condition1 = split_line[1][0:3] == 'USA'
        condition2 = split_line[2][0:3] != 'USA'
        condition3 = split_line[2][0:3] in allActors
        condition4 = split_line[4] in quad_codes

The next portion of code iterates over the filepaths obtained using path. Each file is then opened and iterated over line by line. Each line has any new-line characters replaced, which makes the data easier to work with, and is then split on the basis of tab characters (‘\t’). The following four lines define the logical conditions for subsetting the data. The first condition indicates that the first three characters of Actor1Code should be ‘USA’, while condition2 states that the first three characters of Actor2Code should not equal ‘USA’. condition3 checks if the first three characters of Actor2Code are in the allActors list defined earlier, while condition4 checks if the QuadCategory is one of the desired values.

1
2
3
4
5
try:
    if all([condition1, condition2, condition3, condition4]):
        output.append(split_line)
except IndexError:
    pass

The above code simply checks if all of the various conditions were met, and if so appends the split_line to output. This code is wrapped in a try-except statement since there can be some malformed lines floating in the data, but this should not affect the actual event data. The try-except statements allow for an attempt at a certain block of code, and if an error is raised, in this case an IndexError, for some other actions to occur.

1
2
3
4
5
6
7
8
header = open(filepaths[0], 'r').readline().split('\t')
subset = pd.DataFrame(output, columns = header)

subset['year'] = subset['Day'].map(lambda x: int(str(x)[0:4]))
subset['month'] = subset['Day'].map(lambda x: int(str(x)[4:6]))

keep_columns = ['year', 'month', 'Actor1Code', 'Actor2Code', 'QuadCategory']
subset = subset[keep_columns]

Once the code from the previous sections is finished, the subset of the data is contained in output, which is a list-of-lists with the internal lists representing the individual events. It is possible, using the pandas library, to convert this list-of-lists to a pandas DataFrame object, with the header drawn from the first line of the first file in filepaths. In order to aggregate the data to a specific time period, it is useful to break out the Day variable into months and years using the .map functionality of a Series object in pandas. The .map functionality is combined with a lambda function to select eaach observation in the series and slice it in order to obtain the year and month. Finally, the last two lines of the above code reduce the data down to only the columns relevant for this subset.

1
2
3
4
5
6
7
8
9
10
11
subset['verbal_coop'] = 0
subset['verbal_conf'] = 0

subset['verbal_coop'][subset['QuadCategory'] == '2'] = 1
subset['verbal_conf'][subset['QuadCategory'] == '3'] = 1

subset_grouped = subset.groupby(['year', 'month', 'Actor1Code',
'Actor2Code'], as_index = False)
subset_aggregated = subset_grouped.sum()

subset_aggregated.to_csv('gdelt_subset.csv', index = False)

Now that the data is in a properly formatted DataFrame, the next step is to create variables from which to draw counts of the various event types. Two variables, verbal_coop and verbal_conf are created and assigned values of zero. Then, these variables are assigned values of one if the QuadCategory matches the value for that event type. This functionality in pandas is similar that of R, and I plan on doing a more in-depth tutorial on pandas at a later date. With the event variables created, the data can be grouped and aggregated. pandas has the .groupby functionality for DataFrames, which allows you to group the data by specific variables. For the purposes of this dataset, a multi-level grouping is desired, with groups created by the year, the month, and the dyad. Once this grouping is created, the values are summed within each grouping leading to the final, aggregated dataset. This final data can be saved using the .to_csv method.

And there you have it, you now have a subset of the GDELT dataset. This is a relatively simple task thanks to Python’s built in tools. It is possible to make this task run more quickly using parallel processing, as outlined in my previous post on parallel subsetting. As a brief recap, it is simply a matter of using the jobilb module, wrapping the above subsetting code in a function, and adding something along the lines of

data = Parallel(n_jobs=-1)(delayed(subset)(x) for x in list_of_paths)

to the script. I hope this brief intro will prove helpful to those who wish to use GDELT in their own research. This tutorial only scratched the surface of working with event data, and there is much more to consider beyond just what subset you will select. A good resource on working with event data was written by Jay Yonamine and will likely prove useful to those engaging in event-data research.