Skip to content

Pulling Large GDELT Data

Linwood Creekmore edited this page Apr 29, 2018 · 4 revisions

This quick tutorial gives some code on how to pull GDELT data using gdeltPyR that is large or covers long periods of time.

Why do we need a tutorial on pulling large GDELT data?

We need a tutorial on pulling data because a single day's worth of GDELT data can consume 400+ MBs of RAM; it's easy to see how a query of 10s or dozens of days (not to mention months or years) will exhaust your RAM.

Required tools for this tutorial

First, let's cover the tools you will need:

concurrent futures will help us run parallel process on our gdeltPyR queries. gdeltPyR is how we query the data. And, once our queries are complete, we can use dask to load all the data into an out-of-core dataframe and perform pandas like operations on the data.

Plan

Because the data can be large, our plan is to pull a single day's worth of GDELT data and write that data to disk. Because dask can load multiple files from disk, we will take advantage of this. In short, this allows us to do operations on data that is larger than RAM (real big data problems).

Pulling Version 1 GDELT data

Version 2 GDELT data is more extensive, but it only provides data from February 2015 to present. Therefore, you need to set the version to 1 if you need data before February 2015. The gdeltPyR query would look like this:

from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt

# set up gdeltpyr for version 1
gd = gdelt.gdelt(version=1)

# multiprocess the query
e = ProcessPoolExecutor()

# generic function to pull and write data to disk based on date
def getter(x):
    try:
        date = x.strftime('%Y%m%d')
        d = gd.Search(date)
        d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
    except:
        pass

# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))

Explanation of Code Above

In the first steps, we're just importing the libraries we need. These were mentioned earlier. This line, e = ProcessPoolExecutor() is setting up the multiprocessing job so that we can map operations to each core. If we have 8 cores, we could run 8 gdeltPyR queries simultaneously. Therefore, having more cores will make a longer query return faster. I suggest making use of AWS or Google Compute Engine instances with multiple cores if you want to save time. As you can see, you can have a machine with 90+ cores to run 90 days worth of gdeltPyR queries at one time. That will save you time!!

getter is a utility function to pull data and fail gracefully if no data is returned or something is corrupted from the GDELT server. Then, we use the pandas.DataFrame.to_csv method to export the returned data to a csv file on disk. Because I am pythonically converting the date to a string with this line,date = x.strftime('%Y%m%d') , I use that to build the file name.

Finally, this line, pd.date_range('2015 Apr 21','2018 Apr 21'), uses the pandas.date_range method to build an array of dates between two dates.

The results variable maps our getter function to each date in the array of dates. Each file is written to disk!

Pulling for GDELT version 2

If you need version 2 data (and pull data after February 2015), you can modify the code above slightly. Specifically, you will add version=2 to the gdelt object and add a coverage=True to the gdelt.gdelt.Search method. The modified code will look like this:

from concurrent.futures import ProcessPoolExecutor
import pandas as pd
import gdelt

# set up gdeltpyr for version 2
gd = gdelt.gdelt(version=2)

# multiprocess the query
e = ProcessPoolExecutor()

# generic function to pull and write data to disk based on date
def getter(x):
    try:
        date = x.strftime('%Y%m%d')
        d = gd.Search(date, coverage=True)
        d.to_csv("{}_gdeltdata.csv".format(date),encoding='utf-8',index=False)
    except:
        pass

# now pull the data; this will take a long time
results = list(e.map(getter,pd.date_range('2015 Apr 21','2018 Apr 21')))

Analyzing Big gdeltPyR data

If you need to write the data to disk because it's larger than RAM, it's also likely you'll need a convenient workflow to work with data larger than RAM. dask makes is possible to handle inconveniently large data sets, whether that data set resides on a single laptop or across hundreds or thousands of clusters. We use dask to load multiple csv files with our gdeltPyR data into a single dataframe. We can perform pandas like operations on this larger data set as well! So, the code to load the data:

import dask.dataframe as dd

# read all the gdelt csvs into one dataframe
df = dd.read_csv('*_gdeltdata.csv')

That's it! Whether you have hundreds or thousands of csv files in your target folder, this will read it.

Conclusion

This wiki provided a tutorial on how to pull and process large GDELT data using gdeltPyR. To learn more about using dask, visit their documentation page here.