The Challenge
I’ve been working with some data that is spread over multiple tab-delimited files, and is rather large (on the order of 20-30gb). The task has been to comb through the data, and extract observations (rows) if they match certain characteristics. Specifically, each file is iterated over and if a field within a line matches a given value then the line should be extracted and appended to the final output. This task is relatively straightforward in Python, but to iterate over all of the files takes around 45 minutes. While this isn’t an exorbiant amount of time, fast is always better. This problem is also embarassingly parallel; the different files do not need to communicate their results to each other, the results simply need to be stacked into a final array in order to be saved to a file. Thus began my saga to implement a parallel version of a script I wrote to iterate over the files and select the lines.
The Code
As with my other posts I’ll walk through the code line by line.
1-2 Just imports. I’ll be using the
Parallelanddelayedfunctions from the joblib module.4-13 Defining the function to subset out the data. The code is fairly easy to understand here. The file is opened, the first line is read since this contains the column names and should not be appended to the result. Then each line is iterated over, split on the basis of the tabs, and appended if it meets a certain criteria.
15-19 Function to stack the data. The joblib
Parallelfunction will return a list of lists, with each list within the list being the results from the individual files. Thestackfunction iterates over the list, converts the inner lists to numpy arrays, and stacks the current data with the previous data.22-28 Running the script. The main focus here is on line 27. The first argument that
Paralleltakes is the number of jobs to be used. Settingn_jobsto -1 says to use all possible cores. The second argument is the function to be run in parallel. The joblib docs indicate that “The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.” So delayed is passed thesubsetfunction with the argx, which represents the file to be opened as held infilepath. This data is then stored as a list of lists, and stacked using thestackfunction.
Some quick prelimenary examinations shows that this parallel implementation is much, much faster than running in serial. Running on two files is almost instananeous, which is a drastic improvement.