Processing Data With Hive
Working with big data is not a particularly easy task. There is a lot of commentary on the web about what constitutes “big” data. The GDELT dataset, which is the focus of this post, is over 40 gigabytes uncompressed, so this is not a discussion of “Google-size” data. It is, however, more than most social scientists are used to dealing with in one pass. I’ve attempted to chronicle my history of working with the GDELT dataset to draw interesting conclusions about the world using event data. I’ve been relatively successful so far, but I felt that it was possible to make the data easier to work with. Towards this end, I began to explore SQL and database technologies to use as a subsetting method. I finally landed on Hive, which is part of the Hadoop ecosystem. Hive allows you to run SQL queries (Hive’s language is actually called HiveQL) on top of the map/reduce framework for computation. The data is distributed (mapped) across multiple nodes in a server cluster and queries are run atomically on this set of the data. This distributed data is then recombined (reduced) back into a single output form.
Using Hive, it is possible to run fairly complex queries across the entirety of the GDELT dataset in roughly five minutes. This speed is possible thanks to Amazon’s Elastic MapReduce environment, which makes use of Elastic Cloud Compute (EC2) resources as the computational backend. EC2 makes it cheap and easy to rent a large cluster of servers for cheap; as an example, I have used 40 servers at ~$0.10/hr per server. Thus, the combination of Hive and Amazon Web Services makes it remarkably easy to get up and running with quick queries over this very interesting dataset. The rest of this post shows you how.
Preparing The Data
Before doing any actual analyses with the data it must be loaded into the Amazon Web Services (AWS) environment. The primary service for storing data on AWS is the “Simple Storage Solution” (S3). To upload data to S3, first sign in to your AWS account and head to the AWS Management Console. Once at the console, you should select the S3 services, as seen below.
Within the S3 console, you should create a new bucket by clicking the “Create Bucket” button at the top left of the screen. The bucket name should be globally unique, i.e., it does not appear anywhere else in the S3 system. Your name is likely a good choice. Once you have selected a bucket name, select the “Create” option rather than “Set Up Logging >”. The next step is to navigate to the created bucket and create a new folder; I chose “gdelt” as the folder name.
You should navigate to this new folder and begin uploading the GDELT data. For the purposes of this tutorial only a couple
years are necessary. I recommend downloading the 1979 and 1980 data and uploading only these years to the
folder on S3 as a start. As a recap, you should download the desired GDELT data and
upload these files to the S3 bucket and folder that you created. The Hive queries used later in this post will load all of
the data within a folder into the table, so make sure that only the GDELT data is contained in the bucket/folder path.
Loading the entire dataset onto S3 is a lengthy process. I have the entire dataset uploaded, along with continual, daily updates, and I am planning on making this bucket public, but a lot depends on the potential costs. I will have more to say about this as I find out more information.
Setting Up Hive
With the data uploaded to Amazon, the next step is setting up and using Hive. First, return to the AWS Management Console and
select the Elastic MapReduce (EMR) service. Once in the EMR console, select “Create New Job Flow”, seen in the
top left-hand corner of the EMR console. Once the job flow dialog appears, you should give the job a name, I chose
and select the job type, which should be “Hive Program”.
On the next screen, you should simply choose the “Start an Interactive Hive Session” radio button.
The next screen involves the selection of EC2 instance types of use as the computational power for the map/reduce tasks. For the purposes of this post, I’ll make use of 10 instances of the small server type for the core instance group and another small instance for the master instance. These instances cost $0.06/hr to run, with Amazon rounding up to the nearest hour for any jobs. The queries used in this tutorial should take less than an hour, which means that the total cost will come to roughly $0.70.
The only change needed on the next dialog screen is the selection of a key pair that is used to secure shell (SSH) into the master node.
Creating an EC2 Key Pair is beyond the scope of this tutorial, but Amazon has a tutorial
that should provide all of the necessary information; the “Have Amazon EC2 generate it for you” is the
option I suggest. You should make sure to give the pair a distinctive name and download them to a memorable location on your computer.
The most common place to store ssh keys is in a
~/.ssh directory. The next dialog box does not have any options that need to be changed;
no bootstrap actions are required for our purposes. The final screen shows a review of the job flow that will be created. Make sure
everything looks good and then click “Create Job Flow.” Following this, the job flow should appear in the EMR console.
It will likely take a few minutes for everything to warm up.
Elastic MapReduce Command-Line Interface
In order to communicate with the EMR cluster you have created some additional tools are necessary, specifically, the EMR
command line interface. Amazon has a pretty good tutorial
on this as well. The only additional comments I will provide are to set the “region” argument in the credentials file to “us-east-1”, and that the
“log_uri” should be the bucket we created earlier such and should look like
"s3://<YOURNAME>/" if you used your name to create the bucket.
I do know, however, that I had some issues installing the CLI since I have Ruby version 1.9.3 installed on my computer. I do not know if this is still an issue, but if you have issues with the CLI and have a version of Ruby different than 1.8, a different implementation of the CLI exists at https://github.com/tc/elastic-mapreduce-ruby. You should download this using the ZIP download option towards the top of the screen. The installation and usage for this version of the CLI is the same as with the version compatible with 1.8. The other option is to revert to Ruby 1.8.7.
With the CLI downloaded and set up, you should
cd to the directory and issue the
./elastic-mapreduce --list command. This should show
you the history of any job flows you created, along with the job IDs. The job ID is the first field for each job flow and looks something
j-ABC123DEFG45. This ID is used to SSH into the master node for the job, using the
./elastic-mapreduce -j j-ABC123DEFG45 --ssh or the
./elastic-mapreduce -ssh j-ABC123DEFG45 command, with the latter being
the command to use if you are using the CLI for Ruby 1.8 available from the Github link above.
With this, you should finally have a EMR instance up and running. The next section shows how to use this instance to query the GDELT
data using Hive.
Querying The Data
While at the instance prompt, assuming you have already SSH’ed into the instance, you should issue the command
hive. As a note, I
will not cover the finer points of SQL in this post. SQL is easy to learn and numerous tutorials exist that can help you learn. I
suggest Zed Shaw’s Learn SQL The Hard Way. This section will only cover creating the table that stores
the GDELT data along with some basic queries, along with saving the output to a CSV file.
In order to create the table for GDELT, the following command should be issued in the Hive interpreter. Copy and paste should work fine,
excepting the final line; the last line should be modified to match the path to the S3 bucket to which the data was uploaded, e.g.,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
This creates a table, named
gdelt, with fields for each of the columns available in the GDELT dataset. It is now possible to run some simple queries against
the data. Before doing so, however, it is useful to adjust the number of
reducers used in the map/reduce jobs since these are automatically
determined from the size of the input; upping the number of
reducers can speed the queries. To do this, enter
set hive.exec.reducers.max=200; set mapred.reduce.tasks=200; into the Hive prompt.
The first example query used in this post is the same as in my previous post:
select all events that occurred within Syria, show the dyadic interactions, and create a sum of the Goldstein values for each dyad.
The data is drawn from the
gdelt table created using the previous command.
The code for the query is the exact same as in the previous post:
1 2 3 4 5 6 7 8 9 10
After roughly 10 minutes you should see the results start to stream across the terminal showing the actors in the dyad, the sum of the Goldstein values,
and the count of events within the dyad. While this is interesting and useful, the ultimate goal is to save this subset to a file so it can
serve as the basis for future analysis. To do so, a temporary holding table is necessary as an intermediate step. The following command creates
such a table, named
1 2 3 4 5 6
You can then issue the same command as above, with the slight modification of inserting the results into the new,
1 2 3 4 5 6 7 8 9 10 11
Now, in order to save the results into a single CSV file, issue the following command. This command creates a
csvexport, that is stored in your S3 bucket, with fields separated by commas and lines ended by
characters. Again, make sure to modify the
LOCATION statement to match a path to your S3 bucket. The second
command writes all of the data from
temp into the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
There should now be a file in your S3 bucket, within the folder
output, with a strange name. This file contains the output of the query and can be
downloaded to your local computer. This multi-step process is necessary due to the nature of the map/reduce paradigm. Since the table
created by using multiple reducers, the output would be written to a number of files that reflects the number of reducers used in a query. In the case
of this query, that would be 200 separate files. The creation of the
csvexport table allows for the creation of a single text file since no reduce
jobs are needed for the
SELECT * operation.
This tutorial has only scratched the surface of the power of Hive and SQL. It is important to remember that SQL, and by extension HiveQL, is typically used in business-analytical environments. This means that it is very powerful for the type of quick, informative queries that are also useful for social science data such as GDELT. Running a few queries to see what type of data you are working with is very important when working with data on this scale; it is a waste of time to create a subset of the data and run some complicated analysis, only to realize that you have the wrong subset. Thus, Hive should not be treated just as a tool to quickly subset the data, but instead it should become another piece of the analytical workflow to aid in determining what the subset of the data should be in the first place.