John Beieler

PhD Student in Political Science

Using Hive With Social Science Data

Processing Data With Hive

Working with big data is not a particularly easy task. There is a lot of commentary on the web about what constitutes “big” data. The GDELT dataset, which is the focus of this post, is over 40 gigabytes uncompressed, so this is not a discussion of “Google-size” data. It is, however, more than most social scientists are used to dealing with in one pass. I’ve attempted to chronicle my history of working with the GDELT dataset to draw interesting conclusions about the world using event data. I’ve been relatively successful so far, but I felt that it was possible to make the data easier to work with. Towards this end, I began to explore SQL and database technologies to use as a subsetting method. I finally landed on Hive, which is part of the Hadoop ecosystem. Hive allows you to run SQL queries (Hive’s language is actually called HiveQL) on top of the map/reduce framework for computation. The data is distributed (mapped) across multiple nodes in a server cluster and queries are run atomically on this set of the data. This distributed data is then recombined (reduced) back into a single output form.

Using Hive, it is possible to run fairly complex queries across the entirety of the GDELT dataset in roughly five minutes. This speed is possible thanks to Amazon’s Elastic MapReduce environment, which makes use of Elastic Cloud Compute (EC2) resources as the computational backend. EC2 makes it cheap and easy to rent a large cluster of servers for cheap; as an example, I have used 40 servers at ~$0.10/hr per server. Thus, the combination of Hive and Amazon Web Services makes it remarkably easy to get up and running with quick queries over this very interesting dataset. The rest of this post shows you how.

Preparing The Data

Before doing any actual analyses with the data it must be loaded into the Amazon Web Services (AWS) environment. The primary service for storing data on AWS is the “Simple Storage Solution” (S3). To upload data to S3, first sign in to your AWS account and head to the AWS Management Console. Once at the console, you should select the S3 services, as seen below.

Within the S3 console, you should create a new bucket by clicking the “Create Bucket” button at the top left of the screen. The bucket name should be globally unique, i.e., it does not appear anywhere else in the S3 system. Your name is likely a good choice. Once you have selected a bucket name, select the “Create” option rather than “Set Up Logging >”. The next step is to navigate to the created bucket and create a new folder; I chose “gdelt” as the folder name.

You should navigate to this new folder and begin uploading the GDELT data. For the purposes of this tutorial only a couple years are necessary. I recommend downloading the 1979 and 1980 data and uploading only these years to the <YOURNAME>/gdelt folder on S3 as a start. As a recap, you should download the desired GDELT data and upload these files to the S3 bucket and folder that you created. The Hive queries used later in this post will load all of the data within a folder into the table, so make sure that only the GDELT data is contained in the bucket/folder path.

Loading the entire dataset onto S3 is a lengthy process. I have the entire dataset uploaded, along with continual, daily updates, and I am planning on making this bucket public, but a lot depends on the potential costs. I will have more to say about this as I find out more information.

Setting Up Hive

With the data uploaded to Amazon, the next step is setting up and using Hive. First, return to the AWS Management Console and select the Elastic MapReduce (EMR) service. Once in the EMR console, select “Create New Job Flow”, seen in the top left-hand corner of the EMR console. Once the job flow dialog appears, you should give the job a name, I chose gdelt_query, and select the job type, which should be “Hive Program”.

On the next screen, you should simply choose the “Start an Interactive Hive Session” radio button.

The next screen involves the selection of EC2 instance types of use as the computational power for the map/reduce tasks. For the purposes of this post, I’ll make use of 10 instances of the small server type for the core instance group and another small instance for the master instance. These instances cost $0.06/hr to run, with Amazon rounding up to the nearest hour for any jobs. The queries used in this tutorial should take less than an hour, which means that the total cost will come to roughly $0.70.

The only change needed on the next dialog screen is the selection of a key pair that is used to secure shell (SSH) into the master node. Creating an EC2 Key Pair is beyond the scope of this tutorial, but Amazon has a tutorial that should provide all of the necessary information; the “Have Amazon EC2 generate it for you” is the option I suggest. You should make sure to give the pair a distinctive name and download them to a memorable location on your computer. The most common place to store ssh keys is in a ~/.ssh directory. The next dialog box does not have any options that need to be changed; no bootstrap actions are required for our purposes. The final screen shows a review of the job flow that will be created. Make sure everything looks good and then click “Create Job Flow.” Following this, the job flow should appear in the EMR console. It will likely take a few minutes for everything to warm up.

Elastic MapReduce Command-Line Interface

In order to communicate with the EMR cluster you have created some additional tools are necessary, specifically, the EMR command line interface. Amazon has a pretty good tutorial on this as well. The only additional comments I will provide are to set the “region” argument in the credentials file to “us-east-1”, and that the “log_uri” should be the bucket we created earlier such and should look like "s3://<YOURNAME>/" if you used your name to create the bucket.

I do know, however, that I had some issues installing the CLI since I have Ruby version 1.9.3 installed on my computer. I do not know if this is still an issue, but if you have issues with the CLI and have a version of Ruby different than 1.8, a different implementation of the CLI exists at https://github.com/tc/elastic-mapreduce-ruby. You should download this using the ZIP download option towards the top of the screen. The installation and usage for this version of the CLI is the same as with the version compatible with 1.8. The other option is to revert to Ruby 1.8.7.

With the CLI downloaded and set up, you should cd to the directory and issue the ./elastic-mapreduce --list command. This should show you the history of any job flows you created, along with the job IDs. The job ID is the first field for each job flow and looks something like j-ABC123DEFG45. This ID is used to SSH into the master node for the job, using the either the ./elastic-mapreduce -j j-ABC123DEFG45 --ssh or the ./elastic-mapreduce -ssh j-ABC123DEFG45 command, with the latter being the command to use if you are using the CLI for Ruby 1.8 available from the Github link above. With this, you should finally have a EMR instance up and running. The next section shows how to use this instance to query the GDELT data using Hive.

Querying The Data

While at the instance prompt, assuming you have already SSH’ed into the instance, you should issue the command hive. As a note, I will not cover the finer points of SQL in this post. SQL is easy to learn and numerous tutorials exist that can help you learn. I suggest Zed Shaw’s Learn SQL The Hard Way. This section will only cover creating the table that stores the GDELT data along with some basic queries, along with saving the output to a CSV file.

In order to create the table for GDELT, the following command should be issued in the Hive interpreter. Copy and paste should work fine, excepting the final line; the last line should be modified to match the path to the S3 bucket to which the data was uploaded, e.g., the <YOURNAME>/gdelt format.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt (
 GLOBALEVENTID BIGINT,
 SQLDATE INT,
 MonthYear INT,
 Year INT,
 FractionDate DOUBLE,
 Actor1Code STRING,
 Actor1Name STRING,
 Actor1CountryCode STRING,
 Actor1KnownGroupCode STRING,
 Actor1EthnicCode STRING,
 Actor1Religion1Code STRING,
 Actor1Religion2Code STRING,
 Actor1Type1Code STRING,
 Actor1Type2Code STRING,
 Actor1Type3Code STRING,
 Actor2Code STRING,
 Actor2Name STRING,
 Actor2CountryCode STRING,
 Actor2KnownGroupCode STRING,
 Actor2EthnicCode STRING,
 Actor2Religion1Code STRING,
 Actor2Religion2Code STRING,
 Actor2Type1Code STRING,
 Actor2Type2Code STRING,
 Actor2Type3Code STRING,
 IsRootEvent INT,
 EventCode STRING,
 EventBaseCode STRING,
 EventRootCode STRING,
 QuadClass INT,
 GoldsteinScale DOUBLE,
 NumMentions INT,
 NumSources INT,
 NumArticles INT,
 AvgTone DOUBLE,
 Actor1Geo_Type INT,
 Actor1Geo_FullName STRING,
 Actor1Geo_CountryCode STRING,
 Actor1Geo_ADM1Code STRING,
 Actor1Geo_Lat FLOAT,
 Actor1Geo_Long FLOAT,
 Actor1Geo_FeatureID INT,
 Actor2Geo_Type INT,
 Actor2Geo_FullName STRING,
 Actor2Geo_CountryCode STRING,
 Actor2Geo_ADM1Code STRING,
 Actor2Geo_Lat FLOAT,
 Actor2Geo_Long FLOAT,
 Actor2Geo_FeatureID INT,
 ActionGeo_Type INT,
 ActionGeo_FullName STRING,
 ActionGeo_CountryCode STRING,
 ActionGeo_ADM1Code STRING,
 ActionGeo_Lat FLOAT,
 ActionGeo_Long FLOAT,
 ActionGeo_FeatureID INT,
 DATEADDED INT,
 SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3n://<YOURNAME>/gdelt' ;

This creates a table, named gdelt, with fields for each of the columns available in the GDELT dataset. It is now possible to run some simple queries against the data. Before doing so, however, it is useful to adjust the number of reducers used in the map/reduce jobs since these are automatically determined from the size of the input; upping the number of reducers can speed the queries. To do this, enter set hive.exec.reducers.max=200; set mapred.reduce.tasks=200; into the Hive prompt.

The first example query used in this post is the same as in my previous post: select all events that occurred within Syria, show the dyadic interactions, and create a sum of the Goldstein values for each dyad. The data is drawn from the gdelt table created using the previous command. The code for the query is the exact same as in the previous post:

1
2
3
4
5
6
7
8
9
10
SELECT
Year, Actor1Code, Actor2Code, sum(GoldsteinScale), count(Actor1Code)
FROM
gdelt
WHERE
Actor1CountryCode == 'SYR'
AND
Actor2CountryCode == 'SYR'
GROUP BY
Year, Actor1Code, Actor2Code;

After roughly 10 minutes you should see the results start to stream across the terminal showing the actors in the dyad, the sum of the Goldstein values, and the count of events within the dyad. While this is interesting and useful, the ultimate goal is to save this subset to a file so it can serve as the basis for future analysis. To do so, a temporary holding table is necessary as an intermediate step. The following command creates such a table, named temp.

1
2
3
4
5
6
CREATE TABLE temp (
 Actor1Code STRING,
 Actor2Code STRING,
 SumGoldstein FLOAT,
 EventCount INT
 );

You can then issue the same command as above, with the slight modification of inserting the results into the new, temp table.

1
2
3
4
5
6
7
8
9
10
11
INSERT OVERWRITE TABLE temp
SELECT
Actor1Code, Actor2Code, sum(GoldsteinScale), count(Actor1Code)
FROM
gdelt
WHERE
Actor1CountryCode == 'SYR'
AND
Actor2CountryCode == 'SYR'
GROUP BY
Actor1Code, Actor2Code;

Now, in order to save the results into a single CSV file, issue the following command. This command creates a table, csvexport, that is stored in your S3 bucket, with fields separated by commas and lines ended by \n characters. Again, make sure to modify the LOCATION statement to match a path to your S3 bucket. The second command writes all of the data from temp into the csvexport table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE TABLE csvexport (
 Actor1Code STRING,
 Actor2Code STRING,
 SumGoldstein FLOAT,
 EventCount FLOAT
 )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3n://<YOURNAME>/output';

INSERT OVERWRITE TABLE csvexport
SELECT
*
FROM
temp;

There should now be a file in your S3 bucket, within the folder output, with a strange name. This file contains the output of the query and can be downloaded to your local computer. This multi-step process is necessary due to the nature of the map/reduce paradigm. Since the table temp is created by using multiple reducers, the output would be written to a number of files that reflects the number of reducers used in a query. In the case of this query, that would be 200 separate files. The creation of the csvexport table allows for the creation of a single text file since no reduce jobs are needed for the SELECT * operation.

Wrapping Up

This tutorial has only scratched the surface of the power of Hive and SQL. It is important to remember that SQL, and by extension HiveQL, is typically used in business-analytical environments. This means that it is very powerful for the type of quick, informative queries that are also useful for social science data such as GDELT. Running a few queries to see what type of data you are working with is very important when working with data on this scale; it is a waste of time to create a subset of the data and run some complicated analysis, only to realize that you have the wrong subset. Thus, Hive should not be treated just as a tool to quickly subset the data, but instead it should become another piece of the analytical workflow to aid in determining what the subset of the data should be in the first place.