John Beieler

PhD Student in Political Science

Making GDELT Downloads Easy

Downloading GDELT

Yesterday, I started downloading the GDELT data from the website, having previously pulled the data from the servers at Penn State. Having to navigate the website and download each file individually caused me far more frustration that it should have, not due to the design of the website but due to my own impatience. I’ll have more to say about the general state of data distribution in the social sciences in another post, but for now it’s enough to say that I’m not a fan of downloading data by hand. Because of this, I wrote some scripts in Python to help with downloading the data. There are two scripts, download_historical.py and download_daily.py to download either the historical files or the continuously-updated daily files. The code is on github. I’ve copied the contents of the README file below so you can determine if the scripts would be useful for you.

README

GDELT Download

The GDELT data is spread across multiple files, with a new file added each day. Downloading each and every file is not a fun endeavor. These scripts were written in order to aid in the download of the GDELT data. The first script download_historical.py is aimed at downloading the historical data, and the previous daily updated. The second script, download_daily.py, is aimed at downloading the new files that are uploaded to the GDELT website each day. This script enables the user to either call the script each day to fetch the newest upload, or to run the process in the background to download the new updates each day at 10:00am.

Each script implements a 30 second delay where appropriate in order to avoid swamping the server.

download_historical Usage

The script has three modes: daily, single, and range.

Note: If you wish to use the daily mode, the requests and lxml libaries are necessary. You can install both using pip install library_name. The script also makes use of argparse, which is included in the standard libary from Python 2.7+. If you are using an older version, it is necessary to install argparse using pip or easy_install.

Daily:

The daily mode downloads the daily updates that are currently uploaded to the GDELT website.

Usage:

python download_historical.py daily -d ~/gdelt/ -U

Where -d is the flag for the directory to which the files should be written, and -U is the optional flag indicating whether each downloaded file should be unzipped.

Single:

The single mode downloads the updates for a single year.

Usage:

python download_historical.py single -y 1979 -d ~/gdelt/ -U

Where -y is the flag that indicates which year should be downloaded, -d is the flag for the directory to which the files should be written, and -U is the optional flag indicating whether each downloaded file should be unzipped.

Range:

The range mode downloads the updates for a range of years.

Usage:

python download_historical.py range -y 1979-2012 -d ~/gdelt/ -U

Where -y is the flag that indicates which years should be downloaded, -d is the flag for the directory to which the files should be written, and -U is the optional flag indicating whether each downloaded file should be unzipped.

download_daily Usage

The script has two modes: fetch, and schedule.

Note: If you wish to use the schedule mode, the schedule library is necessary. You can install using pip install schedule.

Fetch:

The fetch mode downloads only the current date’s update.

Usage:

python download_daily.py fetch -d ~/gdelt/ -U

Where -d is the flag for the directory to which the files should be written, and -U is the optional flag indicating whether each downloaded file should be unzipped.

Schedule:

The fetch schedule mode sets the script to run in the background and request each day at 10:00am that date’s upload from the server. In order to work, the script must be left running in a terminal tab. The use of a utility such as screen or tmux is recommended in order to allow the program to run unmonitored in the background.

Usage:

python download_daily.py schedule -d ~/gdelt/ -U

Where -d is the flag for the directory to which the files should be written, and -U is the optional flag indicating whether each downloaded file should be unzipped.