Parallel Data Subsetting

The Challenge

I’ve been working with some data that is spread over multiple tab-delimited files, and is rather large (on the order of 20-30gb). The task has been to comb through the data, and extract observations (rows) if they match certain characteristics. Specifically, each file is iterated over and if a field within a line matches a given value then the line should be extracted and appended to the final output. This task is relatively straightforward in Python, but to iterate over all of the files takes around 45 minutes. While this isn’t an exorbiant amount of time, fast is always better. This problem is also embarassingly parallel; the different files do not need to communicate their results to each other, the results simply need to be stacked into a final array in order to be saved to a file. Thus began my saga to implement a parallel version of a script I wrote to iterate over the files and select the lines.

The Code

import numpy as np from joblib import Parallel, delayed def subset(file): dataOut = [] data = open(file, 'r') data.readline() for line in data: splitLine = line.split('\t') if splitLine[3] == '57': dataOut.append(splitLine) return dataOut def stack(list_of_data, hold_data): for i in xrange(len(list_of_data)): current = np.array(data[i]) hold = np.vstack((hold_data, current)) return hold if __name__ == "__main__": filepath = ['testData1.txt', 'testData2.txt'] hold = [] temp = open(filepath[0], 'r') hold.append(temp.readline().split('\t')) data = Parallel(n_jobs=-1)(delayed(subset)(x) for x in filepath) finalData = stack(data, hold)

As with my other posts I’ll walk through the code line by line.

1-2 Just imports. I’ll be using the Parallel and delayed functions from the joblib module.
4-13 Defining the function to subset out the data. The code is fairly easy to understand here. The file is opened, the first line is read since this contains the column names and should not be appended to the result. Then each line is iterated over, split on the basis of the tabs, and appended if it meets a certain criteria.
15-19 Function to stack the data. The joblib Parallel function will return a list of lists, with each list within the list being the results from the individual files. The stack function iterates over the list, converts the inner lists to numpy arrays, and stacks the current data with the previous data.
22-28 Running the script. The main focus here is on line 27. The first argument that Parallel takes is the number of jobs to be used. Setting n_jobs to -1 says to use all possible cores. The second argument is the function to be run in parallel. The joblib docs indicate that “The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.” So delayed is passed the subset function with the arg x, which represents the file to be opened as held in filepath. This data is then stored as a list of lists, and stacked using the stack function.

Some quick prelimenary examinations shows that this parallel implementation is much, much faster than running in serial. Running on two files is almost instananeous, which is a drastic improvement.

John Beieler

PhD Student in Political Science

Parallel Data Subsetting

The Challenge

The Code