Downloading GDELT
Yesterday, I started downloading the GDELT data from the website,
having previously pulled the data from the servers at Penn State.
Having to navigate the website and download each file individually
caused me far more frustration that it should have, not due to the
design of the website but due to my own impatience. I’ll have more
to say about the general state of data distribution in the social
sciences in another post, but for now it’s enough to say that I’m
not a fan of downloading data by hand. Because of this, I wrote some
scripts in Python to help with downloading the data. There are two
scripts, download_historical.py and download_daily.py to
download either the historical files or the continuously-updated
daily files. The code is on github.
I’ve copied the contents of the README file below so you can determine
if the scripts would be useful for you.
README
GDELT Download
The GDELT data is spread across multiple files, with a new file added each day.
Downloading each and every file is not a fun endeavor. These scripts were
written in order to aid in the download of the GDELT data. The first script
download_historical.py is aimed at downloading the historical data, and the
previous daily updated. The second script, download_daily.py, is aimed
at downloading the new files that are uploaded to the GDELT website each day.
This script enables the user to either call the script each day to fetch the
newest upload, or to run the process in the background to download the new
updates each day at 10:00am.
Each script implements a 30 second delay where appropriate in order to avoid swamping the server.
download_historical Usage
The script has three modes: daily, single, and range.
Note: If you wish to use the daily mode, the requests and lxml libaries
are necessary. You can install both using pip install library_name. The script
also makes use of argparse, which is included in the standard libary from
Python 2.7+. If you are using an older version, it is necessary to install
argparse using pip or easy_install.
Daily:
The daily mode downloads the daily updates that are currently uploaded to the GDELT website.
Usage:
python download_historical.py daily -d ~/gdelt/ -U
Where -d is the flag for the directory to which the files should be written,
and -U is the optional flag indicating whether each downloaded file should
be unzipped.
Single:
The single mode downloads the updates for a single year.
Usage:
python download_historical.py single -y 1979 -d ~/gdelt/ -U
Where -y is the flag that indicates which year should be downloaded, -d
is the flag for the directory to which the files should be written, and -U
is the optional flag indicating whether each downloaded file should be unzipped.
Range:
The range mode downloads the updates for a range of years.
Usage:
python download_historical.py range -y 1979-2012 -d ~/gdelt/ -U
Where -y is the flag that indicates which years should be downloaded, -d
is the flag for the directory to which the files should be written, and -U
is the optional flag indicating whether each downloaded file should be unzipped.
download_daily Usage
The script has two modes: fetch, and schedule.
Note: If you wish to use the schedule mode, the schedule library
is necessary. You can install using pip install schedule.
Fetch:
The fetch mode downloads only the current date’s update.
Usage:
python download_daily.py fetch -d ~/gdelt/ -U
Where -d is the flag for the directory to which the files should be written,
and -U is the optional flag indicating whether each downloaded file should
be unzipped.
Schedule:
The fetch schedule mode sets the script to run in the background and request
each day at 10:00am that date’s upload from the server. In order to work, the
script must be left running in a terminal tab. The use of a utility such as
screen or tmux is recommended in order to allow the program to run
unmonitored in the background.
Usage:
python download_daily.py schedule -d ~/gdelt/ -U
Where -d is the flag for the directory to which the files should be written,
and -U is the optional flag indicating whether each downloaded file should
be unzipped.