Easier subsetting of data
Yes, another post about GDELT. But this one can apply to other datasets, too.
In an earlier post,
I wrote about how to start subsetting the GDELT data
using Python. Others also wrote similar
pieces. Each of these posts used the
same basic idea: iterate over each line of the dataset, split the line based on tabs,
and select the lines that have fields that match some criteria. This was all well and
good, especially when working with the reduced dataset. The release of the full GDELT
data, however, complicates matters somewhat. Whereas the reduced dataset only has 11
fields of data, the full dataset contains 56 or 57 fields, depending on which set
of the full data is under examination. On top of this, I have noticed that when writing
more complex subsetting scripts it is often easy to lose track of the rules for
selection. These rules are also obfuscated in the Python code for splitting and
selecting. What was field 35 again? Suffice to say that I have become tired of writing
subsetting scripts. A second development is my growing using of SQL resources, including
those such as SQLite and Hive for Hadoop. I have found that these resources make parsing data
much easier, and I will have more to say about these technologies, specifically Hive and
Hadoop, in a later post as some projects I am working on develop further. But, currently,
it is possible to make use of SQL queries while still remaining in the Python ecosystem
and making use of fantastic libraries such as pandas. All while avoiding the actual
setup of a SQL database.
pandasql
pandas is, in my opinion, one of the best, and most important,
libraries for data analysis in Python. If you haven’t taken a look at it yet, you are really
doing yourself a disfavor. At its core, pandas is a “library providing high-performance,
easy-to-use data structures and data analysis tools for the Python programming language.”
One of the key features is R-style dataframes in Python. In addition to pandas, the
individuals at y-hat
built a SQL interface for pandas dataframes in a library called pandasql. Using pandasql
is rather easy, and for those who have never used SQL before, the syntax is easy to learn
and comprehend.
Installation of pandasql follows the usual method with Python packages:
pip install pandasql
The example I’ll provide in this post makes use of the GDELT daily update for June 4th. The first step is to take care of the library imports and data loading.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
I found that is necessary to indicate the text encoding when importing the data, else an error is
thrown when trying to use pandasql. The resulting dataframe contains 41,569 events. For this
example, I’m going to take a look at the ongoing crisis in Syria and gather a subset of the
data that reflects this. To do this, I choose only events that have both country codes
as ‘SYR’.
1 2 3 4 5 6 7 8 9 10 11 12 | |
This query results in a dataset with 91 events. The query also took roughly 5 seconds to
complete on my laptop. Not too bad. While this subset can now be used to perform some analyses,
it’s also possible to perform a simple analysis using SQL queries. For instance, I might wish
to know who the different source actors are, along with the sum of the GoldsteinScale
values for each different source actor. SQL queries make this extremely easy and elegant.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
This gives the result of:
1 2 3 4 5 6 7 8 | |
Additionally, we might want to see the same information for dyadic interactions between actors along with information regarding how many events occur within the dyad, which only requires a small modification of the previous query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
It is also possible to perform more complicated groupings and analyses using only SQL but, as
these examples show, even with a very limited knowledge of SQL one is able to draw subsets of
the data and perform some simple, but interesting, analyses. Additionally, since the data is stored
in a pandas dataframe, any analytical methods that are contained in pandas can be used, which
provides another set of powerful options for analysis. The final benefit of pandasql is that writing
these queries is much faster than writing a subsetting script, and the queries are also
remarkably fast; none of the queries used in this post took more than 10 seconds to run.
Notes
While pandasql, and SQL in general, is very useful and far more elegant than hacked together
scripts, there are some limitations. The primary limitation is that this method likely will not
scale well to iterating over the entire GDELT dataset, i.e., all the data in the historical backfile
downloads. The default behavior of sqldf is to hold the data in memory. There is an option to
temporarily write the data to disk, but this still does not address the issue of reading in the
yearly or monthly data into the initial dataframe. There is likely a workaround using the
chunk iterator for pd.read_csv, but in my mind this defeats part of the appeal for using
pandasql in the first place: quickness and ease of use. For working with the entire dataset,
a technology such as Hive, or a full SQL database, is likely a more viable option. I will have
more to say on the use of Hive in a future post. With all that said, however, pandasql is
a very useful tool when analyzing more atomic chunks of the GDELT data; daily, monthly, and perhaps
even a single year for the earlier years is not outside the realm of possibility. Finally, if you
need a resource to learn SQL, I recommend Zed Shaw’s Learn SQL The Hard Way,
which will provide as much SQL as you need to perform data subsetting tasks.