Imagine that you are testing for a simple difference in means between two groups of sample observations. You would first perform a t-test. Then, based on the theoretical distribution of the t-statistic, you would be able to determine the probability of a type I error by reading off the corresponding p-value.

But what if you are doing 100 such tests? Or many *thousands*…say one t-test at each voxel (pixel) in an image? The individual p values, one at each voxel, would still get calculated the same way. While these p-values are accurate, they become less useful because in such a case we are typically interested in a different goal: Rather than controlling alpha at the voxel level, we are usually interested in controlling alpha in some manner across the *entire group* of tests. After all, if we do 100 t-tests, each with an alpha threshold of 0.05, by definition we expect to have on average 5 false positives.

Imagine we have 20 degrees of freedom and are focusing on one voxel that had a t-statistic of 3. This voxel is part of a larger group of 100 voxels, which means that we need to somehow control (read: adjust) the p-values we report based on these multiple comparisons.

nVert = 100 # Number of vertices (i.e. # of multiple comparisons) nPerm = 10000 # Number of permutations to perform df = 20 # t-test degrees of freedom testT = 3 # Our real t statistic that we want p-values for thres = 0.05 # Significance threshold t_mat = matrix(rt(nVert * nPerm, df), nPerm, nVert) t_mat = abs(t_mat)

First, we can look at the theoretical distribution of our test statistic, and calculate the theoretical pointwise p-value by determining the fraction of the area under the curve that lies to the right of our example test statistic. Our example t-statistic is shown in red, and the alpha=0.05 threshold is shown as a dotted line. If we were only concerned with this voxel, our p-value would be highly significant.

(p_theo = 2*pt(testT, df, low=F)) dev.new() x = seq(0, 7, len=100) plot(x, 2*dt(x, df), type='l', main='Theoretical') abline(v=qt(thres/2, df, low=F), lty=3) abline(v=testT, col='red')

To get a feel for how we will use permutations to generate *corrected* p-values, we can first use permutation-based methods generate an *uncorrected* p-value that should be equivalent to what we just determined by theory. To do so, you permute the group labels and recalculate the t-statistic at each voxel. This is repeated many (thousands) times, and all of these t-statistics are recorded. Since permuting group membership negates any group effect that was in the data, this process allows us to build up an empirical null distribution of the t-statistic. Similar to above, we then determine the p-value by finding where our example test statistic falls on this histogram.

Note: Instead of going to the trouble of assigning groups and then permuting them, here I just randomly draw from the t distribution.

(p_emp = length(t_mat[t_mat > testT]) / (nVert*nPerm)) crit1 = floor((1-thres)*nVert*nPerm) dev.new() breaks=seq(0,7,len=30) hist(t_mat, breaks, freq=F, main='Empirical') abline(v=sort(t_mat)[crit1], lty=3) abline(v=testT, col='red') title('Empirical')

Finally, we can apply permutation testing to control the family-wise type I error rate - that is, the chance of *any* false positives across *all* of our 100 multiple comparisons. This is done similar to above, but instead of recording all of the test statistics at the end of each round of permutation, only the maximum test statistic across all of the voxels is recorded. This lets us build up an empirical null distribution of the *maximum* test statistic, and comparing our example test statistic to this histogram allows us to generate a p-value that is corrected for multiple comparisons by controlling the family-wise error rate.

maxT_dist = apply(t_mat, 1, max) (p_cor = length(maxT_dist[maxT_dist > testT]) / length(maxT_dist)) crit2 = floor((1-thres)*nPerm) dev.new() hist(maxT_dist, breaks, freq=F, main='Empirical, corrected') abline(v=sort(maxT_dist)[crit2], lty=3) abline(v=testT, col='red') abline(v=qt(thres/(2*nVert), df, low=F), lty=2) # Bonferroni

Notice how the histogram has shifted over to the right. As we expected, this means that any uncorrected p-value derived from an individual t-statistic will be less significant now that we have corrected for the multiple comparisons made by our many t-tests. For our example test statistic, we get:

- Theoretical: p=0.007
- Empirical: p=0.007
- Empirical, corrected: p=0.50

Another way of controlling for multiple comparisons is to use the Bonferroni correction, which simply scales the alpha threshold and p-values according to how many comparisons you are making. The Bonferroni-corrected alpha=0.05 threshold is also shown below with a dashed line. Because the multiple tests in this example are independent, this correction aligns with that of our permutation methods. However, in real data where the tests are often correlated (like neuroimaging data), the Bonferroni correction can give overly-conservative results.

The R code for this example is available here.

<bibtex> @article{nichols_nonparametric_2002,

title = {Nonparametric permutation tests for functional neuroimaging: a primer with examples}, volume = {15}, url = {http://www3.interscience.wiley.com/journal/86010644/abstract?CRETRY=1&SRETRY=0}, number = {1}, journal = {Human brain mapping}, author = {Thomas Nichols and Andrew Holmes}, month = jan, year = {2002}, keywords = {Brain, Brain Mapping, Data Interpretation: Statistical, Humans, Image Processing: {Computer-Assisted,} Magnetic Resonance Imaging, Signal Processing: {Computer-Assisted,} Statistical Distributions, Tomography: {Emission-Computed}}, pages = {1-25}

}

@article{nichols_controlling_2003,

title = {Controlling the familywise error rate in functional neuroimaging: a comparative review}, volume = {12}, url = {http://smm.sagepub.com/cgi/reprint/12/5/419}, number = {5}, journal = {Statistical methods in medical research}, author = {Thomas Nichols and Satoru Hayasaka}, month = oct, year = {2003}, keywords = {Brain, Data Interpretation: Statistical, Humans, Magnetic Resonance Imaging, Models: Statistical, Sensitivity and Specificity, Stochastic Processes}, pages = {419-46}

},

@article{genovese_thresholding_2002,

title = {Thresholding of statistical maps in functional neuroimaging using the false discovery rate}, volume = {15}, url = {http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WNP-45FSGJF-C&_user=4423&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion=0&_userid=4423&md5=277b05a654d453dc104038d9f9cf960d}, doi = {10.1006/nimg.2001.1037}, number = {4}, journal = {Neuroimage}, author = {Christopher R Genovese and Nicole A Lazar and Thomas Nichols}, month = apr, year = {2002}, keywords = {Adult, Artifacts, Brain Mapping, Cerebral Cortex, Computer Simulation, Female, Humans, Image Processing: {Computer-Assisted,} Magnetic Resonance Imaging, Male, Mathematical Computing, Motor Activity, Reference Values}, pages = {870-8}

}

</bibtex>

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International