The Signal and The Noise
One of the first things that I tell people about GDELT is that it is messy and noisy. This is not meant to be a dig at the dataset. This is just a reality of dealing with data on this scale that is generated by automated processes with little to no human supervision. One resulting aspect of this noisiness is a potentially high rate of false positives. Mike Ward and crew recently authored a blog post comparing GDELT data against ICEWS data. One conclusion of the post is that there are a high number of false positives contained in GDELT. The flip side of this, however, is that GDELT likely has a lower number of false negatives than other datasets; GDELT likely captures more events that actually happen.
Reducing the Noise
One feature of GDELT that many people don’t know about, or don’t understand, is that full-text coding is utilized, i.e., an entire news article is coded, versus the lead sentence or paragraph (root) coding used by previous projects. There is some debate about whether full-text coding provides additional, valuable information. For the purposes of this post I assume that there is additional value added by full-text coding. The important point, however, is that full-text coding likely adds more noise into the dataset by picking up things such as background information that is often reported in news articles. For those who aren’t familiar, news articles often follow a set format with the lead paragraph purposefully designed to provide the majority of the information about a news article. By focusing only on the leads, though, it is possible to miss some additional information about related events. In short, there is a tradeoff between full-text and root coding. Before discussing which type of coding is better for various situations, I’d like to provide a bit of information as to the empirical distribution of root and non-root events in GDELT.
The GDELT dataset helpfully provides a variable,
IsRootEvent, that indicates whether or not a coded event
occurs within the lead paragraph of a news story. For the purposes of this analysis, I replicate the
data used in the blog post
by Mike Ward that was mentioned above (thanks to Matt Dickenson for
providing the exact queries used). The data covers 2011-01-01 to 2012-09-03, with protest events pulled
for Egypt and Turkey and violent events pulled for Syria. There are 364,492 total events within the dataset.
The figure below gives comparison of the number root and non-root events
contained in the data.
In addition to the above bar plot, the stacked time-series plot below shows the ratio of root to non-root events over time.
Again, it is important to remember that journalistic practice almost ensures that background information is given in later paragraphs. This means that review sentences such as “over 10,000 have been killed over the past few months in Syria” tend to appear in this area. This leads me to believe that some of the spikes seen in non-root events over time are a result of these background sentences. With that said, there can also be events in the body of a news story that might not be captured in other stories. To reiterate, there is a tradeoff between keeping and ignoring these stories.
As a final view, the below map shows non-root events (red) vs. root events (blue). The purple dots are created from overlaying the two event types on top of each other; purple dots indicate areas that have coverage from both the root and non-root events. From even a cursory glance at this map it seems that neither event type has a privilege on coverage. The two types of events agree on many of the locations, and there is no systematic pattern in the areas where the two event types do not agree. Thus, it seems that one can safely utilize only the root events without sacrificing a significant amount of geographical coverage.
To get to the heart of the matter: if you’re worried about false positives, then filter the GDELT data based on root events. Filtering in this manner shouldn’t cause that great of an information loss. Note, however, that this filtering also won’t give you a completely clean, noiseless dataset. The data is just more likely to have a reduced level of noise. On the flip side, if you’re more worried about attempting to capture everything that happened, then make use of the entire dataset.
Tools, data, etc.