Bootstrap resampling

A sampling distribution shows us what would happen if we took very many samples under the same conditions. The bootstrap is a procedure for finding the (approximate) sampling distribution from just one sample.

The original sample represents the distribution of the population from which it was drawn. Resamples, taken with replacement from the original sample are representative of what we would get from drawing many samples from the population (the distribution of the statistics calculated from each resample is known as the bootstrap distribution of the statistic). The bootstrap distribution of a statistic represents that statistic’s sampling distribution.

TASK Study this cheatsheet and link the relevant sections to each step given above.

Example: constructing bootstrap confidence intervals

Old faithful is a gyser located in Yellowstone National Park, Wyoming. Below is a histogram of the duration of 299 consecutive eruptions. Clearly the distribution is bimodal (has two modes)!

ggplot(data = geyser, aes(x = duration)) +
  geom_histogram() +
  xlab("Duration of eruptions (m)") +
  theme_linedraw()

Step 1: Calculating the observed mean eruption duration time:

mean <- geyser %>%
  summarise(mean = mean(duration))
mean
##       mean
## 1 3.460814

Step 2: Construct bootstrap distribution

## Number of times I want to bootstrap
nreps <- 1000   
## initialize empty array to hold results
bootstrap_means <- numeric(nreps)
set.seed(1234) ## *****Remove this line for actual analyses*****
## This means that each run with produce the same results and
## agree with the printout that I show.
for (i in 1:nreps) {
  ## bootstrap. note with replacement
  bootstrap_sample <- sample(geyser$duration, replace = TRUE)
  ##  bootstrapped mean resample
  bootstrap_means[i] <- mean(bootstrap_sample)
}

Step 3: Collate the sampling distribution

## results
results <- data.frame(bootstrap_means = bootstrap_means)
ggplot(data = results, aes(x = bootstrap_means)) +
  geom_histogram() +
  geom_vline(xintercept = as.numeric(mean)) +
  ggtitle("Sampling distribution of the mean") +
  xlab("Bootstrap means")  + ylab("") + theme_classic()

Step 4: Calculate quantities of interest from the sampling distribution

The bootstrap estimate of bias is the difference between the mean of the bootstrap distribution and the value of the statistic in the original sample:

bias <- as.numeric(mean) - mean(results$bootstrap_means)
bias
## [1] 0.001200111

The bootstrap standard error of a statistic is the standard deviation of its bootstrap distribution:

sd(results$bootstrap_means)
## [1] 0.06740607
## compare to SEM of original data
 MASS::geyser %>%
  summarise(sem = sd(duration)/sqrt(length(duration)))
##          sem
## 1 0.06638498

A Bootstrap \(t\) confidence interval. If, for a sample of size \(n\), the bootstrap distribution is approximately Normal and the estimate of bias is small then an approximate \(C\) confidence for the parameter corresponding to the statistic is:

\[\text{statistic} \pm t^* \text{SE}_\text{bootstrap}\] where \(t*\) is the critical value of the \(t_{n-1}\) distribution with area \(C\) between \(-t^*\) and \(t^*\).

For \(C = 0.95\):

as.numeric(mean) + c(-1,1) * qt(0.975,298)*sd(results$bootstrap_means)
## [1] 3.328162 3.593466

So our 95% confidence interval is 3.3 to 3.6.

A bootstrap \(percentile\) confidence interval. Use the bootstrap distribution itself to determine the limits of the confidence interval by taking the limits of the sorted, central \(C\) bulk of the distribution. For \(C = 0.95\):

sort(results$bootstrap_means)[c(25,975)]
## [1] 3.328428 3.591081

Resampling procedures, the differences

Resampling methods are any of a variety of methods for doing one of the following

Estimating the precision of sample statistics (e.g., bootstrapping)
Performing significance tests (e.g., permutation/exact/randomisation tests)
Validating models (e.g., bootstrapping, cross validation)

Permutation vs bootstrap test

The permutation test exploits symmetry under the null hypothesis.
A full permutation test p-value is exact, conditional on data values in the combined sample.
A bootstrap estimates the probability mechanism that generated the samples under the null hypothesis.
A bootstrap does not rely on any special symmetry or assumption or exchangeability.