Bootstrapped t-test

The below code is to perform a nonparametric two-sample t-test using bootstrapping. First I present the code, and will then follow up with a line-by-line description of what’s going on.

from __future__ import division import numpy as np import pandas as pd import random def sample(data): sample = [random.choice(data) for _ in xrange(len(data))] return sample def bootstrap_t_test(treatment, control, nboot = 1000, direction = "less"): ones = np.vstack((np.ones(len(treatment)),treatment)) treatment = ones.conj().transpose() zeros = np.vstack((np.zeros(len(control)), control)) control = zeros.conj().transpose() Z = np.vstack((treatment, control)) tstat = np.mean(treatment[:,1])-np.mean(control[:,1]) tboot = np.zeros(nboot) for i in xrange(nboot): sboot = sample(Z) sboot = pd.DataFrame(np.array(sboot), columns=['treat', 'vals']) tboot[i] = np.mean(sboot['vals'][sboot['treat'] == 1]) - np.mean(sboot['vals'][sboot['treat'] == 0]) - tstat if direction == "greater": pvalue = np.sum(tboot>=tstat-0)/nboot elif direction == "less": pvalue = np.sum(tboot<=tstat-0)/nboot else: print 'Enter a valid arg for direction' print 'The p-value is %f' % (pvalue)

1-4: Just some imports. We need the floating point division from the future module, numpy, pandas, and the random module.
6-8: Defining a function sample that samples with replacement from the dataset, creating a new dataset of the same length as the original data. This makes use of the random.choice function, which samples one item randomly from the data. This function is called the same number of times as there are observations in the data.
10-17: Defining a function to perform the t-test, with two data inputs, the number of repititions to be performed, and the direction of the alternative hypothesis. First a 2 x n matrix is defined, with row 1 being all ones and row 2 being the data. This is then flipped to create an n x 2 matrix. The same procedure is then repeated for the control data, except with 0s instead of 1s. These two matrices are then stacked on top of each other. tstat is the difference between the two groups, and tboot is a vector of zeros with length equal to the number of repititions for the bootstrap.
18-21: This for-loop actually performs the bootstrap for the number of times indicated by nboot. First, a sample of the data (Z) is taken using the sample function defined above. This is then transformed into a pandas DataFrame, and given appropriate column names. Finally, the difference in means of the two groups is taken for each iteration of the loop and stored in the appropriate location in tboot.
22-28: This is simply calculating a proportion of samples that were greater or less than the test statistic, based on the direction of the alternative hypothesis. The final line (29) then prints the p-value as a float.

John Beieler

PhD Student in Political Science