John Beieler

PhD Student in Political Science

Actor Codes in GDELT

As I’m sure anyone who has begun exploring the GDELT data has noticed, there are a large number of unique actors coded in the data. While working on a project for a colleague, I began to wonder exactly how many unique actor codes exist in the dataset, and what the maximum level of actor code complexity is within the dataset. For those who don’t know, the actor codes are created by chaining together three character CAMEO actor codes, which results in actor codes that look like USAMIL, which indicates the United States military. The complexity of the actor code, then, refers to how many of these three character codes are chained together.

Results

Using the Python code available here, I iterated over the dataset and identified each unique actor code in the data, along with how many times the unique code appeared. The resulting data is available here. The results indicate that there are 22,857 unique actors in the dataset, with a maximum of 6 three character codes chained together. The following two plots show the top twenty actor codes overall, along with the top twenty USA actor codes.

This analysis shows the incredible complexity that is present in the actor codes, which I would argue is the most important, fascinating, and challenging part of working with event data on this scale.

GDELT, Big Data, and Theory

I made the remark on Twitter that it seemed like GDELT week due to a Foreign Policy piece about the dataset, Phil and Kalev’s paper for the ISA 2013 meeting, and a host of blog posts about the data. So, in the spirit of GDELT week, I thought I would throw my hat into the ring. But instead of taking the approach of lauding the new age that is approaching for political and social research due to the monstrous scale of the data now available, I thought I would write a little about the issues that come along with dealing with such massive data.

Dealing with GDELT

As someone who has spent the better part of the past 8 months dealing with the GDELT dataset, including writing a little about working with the data, I feel that I have a somewhat unique perspective. The long and the short of my experience is: working with data on this scale is hard. This may strike some as obvious, especially given the cottage industry that has sprung up around Hadoop and and other services for processing data. GDELT is 200+ million events spread across several years. Each year of the reduced data is in a separate file and contains information about many, many different actors. This is part of what makes the data so intriguing and useful, but the data is also unlike data such as the ever-popular MID data in political science that is easily managed in a program like Stata or R. The data requires subsetting, massaging, and aggregating; having so much data can, at some points, become overwhelming. What states do I want to look at? What type of actors? What type of actions? What about substate actors? Oh, what about the dyadic interactions? These questions and more quickly come to the fore when dealing with data on this scale. So while the GDELT data offers an avenue to answer some existing questions, it also brings with it many potential problems.

Careful Research

So, that all sounds kind of depressing. We have this new, cool dataset that could be tremendously useful, but it also presents many hurdles. What, then, should we as social science researchers do about it? My answer is careful theorizing and thinking about the processes under examination. This might be a “well, duh” moment to those in the social sciences, but I think it is worth saying when there are some heralding “The End of Theory”. This type of large-scale data does not reduce theory and the scientific method to irrelevance. Instead, theory is elevated to a position of higher importance. What states do I want to look at? What type of actions? Well, what does the theory say? As Hilary Mason noted in a tweet:

Data tells you whether to use A or B. Science tells you what A and B should be in the first place.

Put into more social-scientific language, data tells us the relationship between A and B, while science tells us what A and B should be and what type of observations should be used. The data under examination in a given study should be driven by careful consideration of the processes of interest. This idea should not, however, be construed as a rejection of “big data” in the social sciences. I personally believe the exact opposite; give me as many features, measures, and observations as possible and let algorithms sort out what is important. Instead, I think the social sciences, and science in general, is about asking interesting questions of the data that will often require more finesse than taking an “ANALYZE ALL THE DATA” approach. Thus, while datasets like GDELT provide new opportunities, they are not opportunities to relax and let the data do the talking. If anything, big data generating processes will require more work on the part of the researcher than previous data sources.

How Do I GDELT?: Subsetting and Aggregating the GDELT Dataset

GDELT

Over the past week, the Global Data on Events, Location and Tone (GDELT) dataset was finally released to the general public. The data is available at the Penn State event data website. We at Penn State had the good fortune to have access to this dataset for many months before its public release. This allowed us to gain some experience working with this massive collection of data. As a brief background, GDELT is comprised of event data records spanning 1979 - mid 2012. The events are coded according to the CAMEO coding scheme, with the addition of a “QuadCategory,” which separates the events into the material conflict, material cooperation, verbal conflict, and verbal conflict categories. The data is spread across 33 different files, each of which is substantially large on its own. This makes it fairly difficult to work with, and almost guarantees that some subset of the data is necessary in order to perform analysis. Phil Schrodt has included some programs with the data to aid in this subsetting, but I thought there might be some who would prefer to get their hands dirty and write some of their own code. Given this, I thought I would share some of the knowledge I gained while working with the GDELT dataset.

For the purposes of this brief introduction, I will work under the assumption that the desired events are those that originate from the United States, are directed at some type of state actor, and are either verbal cooperation or conflict. The following code, written in Python, demonstrates how this subset might be selected from the GDELT data. The code also assumes the reader has the pandas and path Python modules installed. Both can be installed using the normal pip install method. Finally, the complete code is available as a gist on github.

Before starting, it is always useful to take a peek at the data. This is as simple as opening up the terminal and using head 1979.reduced.txt. Doing this shows the various columns in the reduced data and how they are arranged. We can see that the date is in the 0 index, with Actor1Code and Actor2Code in spots 1 and 2, respectively. Additionally, EventCode is located in spot 3, while the QuadCategory variable is, fittingly, in position 4. These indices will prove crucial when it comes time to split and subset the data.

Web Scraping Tutorial

I had the opportunity to give a short tutorial on web scraping for the Event Data class here at Penn State. I used an IPython notebook to give the presentation and I’ve put the code in a gist. The link to the IPython notebook is http://nbviewer.ipython.org/4743272/.

The PITF project I make reference to is hosted on github.

Parallel Data Subsetting

The Challenge

I’ve been working with some data that is spread over multiple tab-delimited files, and is rather large (on the order of 20-30gb). The task has been to comb through the data, and extract observations (rows) if they match certain characteristics. Specifically, each file is iterated over and if a field within a line matches a given value then the line should be extracted and appended to the final output. This task is relatively straightforward in Python, but to iterate over all of the files takes around 45 minutes. While this isn’t an exorbiant amount of time, fast is always better. This problem is also embarassingly parallel; the different files do not need to communicate their results to each other, the results simply need to be stacked into a final array in order to be saved to a file. Thus began my saga to implement a parallel version of a script I wrote to iterate over the files and select the lines.

The Code

As with my other posts I’ll walk through the code line by line.

  • 1-2 Just imports. I’ll be using the Parallel and delayed functions from the joblib module.

  • 4-13 Defining the function to subset out the data. The code is fairly easy to understand here. The file is opened, the first line is read since this contains the column names and should not be appended to the result. Then each line is iterated over, split on the basis of the tabs, and appended if it meets a certain criteria.

  • 15-19 Function to stack the data. The joblib Parallel function will return a list of lists, with each list within the list being the results from the individual files. The stack function iterates over the list, converts the inner lists to numpy arrays, and stacks the current data with the previous data.

  • 22-28 Running the script. The main focus here is on line 27. The first argument that Parallel takes is the number of jobs to be used. Setting n_jobs to -1 says to use all possible cores. The second argument is the function to be run in parallel. The joblib docs indicate that “The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.” So delayed is passed the subset function with the arg x, which represents the file to be opened as held in filepath. This data is then stored as a list of lists, and stacked using the stack function.

Some quick prelimenary examinations shows that this parallel implementation is much, much faster than running in serial. Running on two files is almost instananeous, which is a drastic improvement.

R Magic and Bootstrapped T-test

Following up on my last post, I wanted a way to test my bootstrapped t-test function against the regular t-test function in R. While I was able to do this by copy-pasting between R and a Python shell, this was less than ideal. I then saw, however, a post by Christopher Fonnesbeck that discussed the use of the rmagic function in ipython, which can be loaded using the %load_ext magic function. So, with this in mind, I decided to test it out using a comparison between my bootstrap function and the t.test function in R. As a note, the rmagic extension requires rpy2, so just pip install rpy2 and you should be good to go.

1
2
3
4
5
6
7
8
9
10
import bootFunction

%load_ext rmagic
%R treatment = c(24,25,28,28,28,29,29,31,31,35,35,35)
%R control = c(21,22,24,27,27,28,29,32,32)
%Rpull treatment control

bootFunction.bootstrap_t_test(treatment, control, direction = "greater")

%R print(t.test(treatment, control, alternative = "greater"))

I first import the set of functions from the bootFunction. I then load the rmagic extension using the %load_ext magic function. Using the %R magic function I then defined two vectors of data, treatment and control, in the R space. I then used %Rpull to pull the two vectors from the R space into the Python shell. The two variables become structured numpy arrays. I then perform the bootstrapped t-test as described in the earlier post. Finally, using the %R magic function again, I print out the results of the t.test function in R using the same data. The p-values aren’t exactly the same, as is to be expected, but are at least within the same ballpark (the R t-test gives .05, while the boostrap function has returned a range between .05 and .03).

Bootstrapped T-test

The below code is to perform a nonparametric two-sample t-test using bootstrapping. First I present the code, and will then follow up with a line-by-line description of what’s going on.

  • 1-4: Just some imports. We need the floating point division from the future module, numpy, pandas, and the random module.

  • 6-8: Defining a function sample that samples with replacement from the dataset, creating a new dataset of the same length as the original data. This makes use of the random.choice function, which samples one item randomly from the data. This function is called the same number of times as there are observations in the data.

  • 10-17: Defining a function to perform the t-test, with two data inputs, the number of repititions to be performed, and the direction of the alternative hypothesis. First a 2 x n matrix is defined, with row 1 being all ones and row 2 being the data. This is then flipped to create an n x 2 matrix. The same procedure is then repeated for the control data, except with 0s instead of 1s. These two matrices are then stacked on top of each other. tstat is the difference between the two groups, and tboot is a vector of zeros with length equal to the number of repititions for the bootstrap.

  • 18-21: This for-loop actually performs the bootstrap for the number of times indicated by nboot. First, a sample of the data (Z) is taken using the sample function defined above. This is then transformed into a pandas DataFrame, and given appropriate column names. Finally, the difference in means of the two groups is taken for each iteration of the loop and stored in the appropriate location in tboot.

  • 22-28: This is simply calculating a proportion of samples that were greater or less than the test statistic, based on the direction of the alternative hypothesis. The final line (29) then prints the p-value as a float.

Starting With Python

Overview

What is this?

In short, setting up Python (and other things) for scientific computing and research can be entirely more complicated than necessary. With that said, this aims to be a short how-to guide pointing to some resources that can make life much easier. This post is geared towards Political Scientists coming from either 1) using R as a programming language or 2) having no programming and minimal computing experience. Most of the things listed here have been attempted by me, but I make no guarantees that anything will work properly or won’t mess something up when attempted. As with anything proceed with caution and at your own risk.

This how-to is mainly geared towards OS X, but many of the suggestions should also work on Linux (and are probably easier). I don’t have any experience setting up Windows and would probably suggest looking into dual booting Linux (see here for more). Downloading Ubuntu to a CD and setting up a dual boot is extremely easy.

I’ll be adding to this as I have time and think of different things that have helped me. I know this post is long, but there is a large amount of information to share, and I think it is easier to get a lot of it in one place, rather than spread out.