Confidence intervals for a summary statistic of numeric variables in a data frame.

describe_ci_all extends the functionality of describe_ci by using the power of map to enable you to obtain confidence intervals for a chosen summary statistic for each numeric variable in a data frame split by any number of grouping variables. Like stat_ci you can specify any function that operates on a numeric variable and returns a single value (e.g. mean, median, sd, se, etc.). Calculations of confidence intervals for the mean are obtained based on reference to the theoretical normal/gaussian distribution for speed, otherwise bootstrapping is used, with options for multicore machines to use parallel processing which can speed things up quite a bit for larger samples. stat_ci may be useful instead of describe_ci if you need to pass additional arguments to the chosen summary statistic function (which is what that function uses the ... argument for). To get confidence intervals for a single numeric variable in a data frame, use describe_ci instead.

describe_ci_all(
  data,
  ...,
  stat = mean,
  replicates = 2000,
  ci_level = 0.95,
  ci_type = c("perc", "bca", "basic", "norm"),
  parallel = FALSE,
  cores = NULL,
  na.rm = TRUE,
  output = c("dt", "tibble")
)

Arguments

data: A data frame or tibble containing the numeric vectors to be described and any grouping variables ("...").
...: If the data object is a data.frame, this special argument accepts any number of unquoted grouping variable names (also present in the data source) to use for subsetting, separated by commas (e.g. group_var1, group_var2)
stat: the unquoted name (e.g. mean, not "mean") of a summary statistic function to calculate confidence intervals for. Only functions which return a single value and operate on numeric variables are currently supported.
replicates: The number of bootstrap replicates to use to construct confidence intervals for statistics other than the sample mean. Default is 2,000, as recommended by Efron & Tibshirani (1993). For publications, or if you need more precise estimates, more replications (e.g. >= 5,000) are recommended. N.B. more replications will of course take longer to run. If you get the error: "estimated adjustment 'a' is NA" when ci_type is set to "bca" then try again with more replications.
ci_level: The confidence level to use for constructing confidence intervals. Default is set to ci_level = 0.95 for 95 percent CIs.
ci_type: The type of confidence intervals to calculate from the bootstrap samples. Most of the options available in the underlying boot.ci function are implemented (except for studentized intervals): "norm" for an approximation based on the normal distribution, "perc" for percentile, "basic" for basic, and "bca" for bias-corrected and accelerated. Percentile intervals are the default since these are typically sufficient when working with large data sets (e.g. >= 100,000 rows of data) and are faster to calculate than BCa intervals. However, BCa intervals (the default for the more primitive stat_ci function) tend to provide the most accurate/least-biased results (Efron, 1987), particularly for small-medium sized samples, at the obvious cost of requiring more time to calculate. See boot.ci for details.
parallel: set to TRUE if you want to use multiple cores or FALSE if you don't (the default). Note that there is some processing overhead involved when operating in parallel so speed gains may not be very noticeable for smaller samples (and may even take longer than sequential processing). Due to the nature of the underlying parallelization architecture, performance gains will likely be greater on non-Windows machines that can use the "multicore" implementation instead of "snow". For obvious reasons this option only works on machines with more than 1 logical processing core.
cores: If parallel is set to TRUE, this determines the number of cores to use. To see how many cores are available on your machine, use parallel::detectCores(). If cores is unspecified the number of available cores - 1 will be used by default.
na.rm: should missing values be removed before attempting to calculate the chosen statistic and confidence intervals? Default is TRUE.
output: "tibble" for tibble or "dt" for data.table. Tibble is used as the default output to facilitate subsequent use/modification of the output with the tidyverse collection of packages.

References

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American statistical Association, 82(397), 171-185.

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall.

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


#using a single core (sequential processing)
if (FALSE) {
describe_ci_all(pdata[1:1000, ], stat = median) #bootstrapped CIs for the median
}
describe_ci_all(pdata, stat = mean) #the default
#>    variable     mean     lower     upper
#> 1:       id 500.5000 495.33483 505.66517
#> 2:       y1 153.7048 152.94037 154.46919
#> 3:       y2 100.0921  99.91106 100.27319
#> 4:       x1  50.4945  49.97677  51.01223
#> 5:       x2 150.6513 150.13560 151.16690
#> 6:       x3 250.4971 249.98063 251.01354
describe_ci_all(pdata, high_low, stat = mean) #split by a grouping variable
#>     variable high_low      mean     lower     upper
#>  1:       id     high 494.33929 487.07751 501.60106
#>  2:       id      low 506.75382 499.40845 514.09919
#>  3:       x1     high  50.49909  49.77409  51.22409
#>  4:       x1      low  50.48984  49.75031  51.22937
#>  5:       x2     high 150.87031 150.14356 151.59706
#>  6:       x2      low 150.42888 149.69711 151.16065
#>  7:       x3     high 250.45806 249.72851 251.18762
#>  8:       x3      low 250.53669 249.80545 251.26793
#>  9:       y1     high 153.74021 152.65785 154.82256
#> 10:       y1      low 153.66882 152.58911 154.74853
#> 11:       y2     high 108.08392 107.93062 108.23722
#> 12:       y2      low  91.97956  91.82262  92.13649

if (FALSE) {
#using multiple cores (parallel processing)
describe_ci_all(pdata[1:1000, ], stat = sd, parallel = TRUE, cores = 2)
}

Confidence intervals for a summary statistic of numeric variables in a data frame.

Arguments

References

See also

Author

Examples