This function extends {describe} by applying to it all columns of the specified class(es) in a data frame using functional programming tools from the purrr package (e.g. map). To obtain a summary of a single variable in a data frame use describe instead.

describe_all(
  data,
  ...,
  class = "all",
  digits = 3,
  type = 2,
  na.rm = TRUE,
  sep = "_",
  output = c("dt", "tibble")
)

Arguments

data

A data frame or tibble.

...

This special argument accepts any number of unquoted grouping variable names (also present in the data source) to use for subsetting, separated by commas, e.g. group_var1, group_var2. Also accepts a character vector of column names or index numbers, e.g. c("group_var1", "group_var2") or c(1, 2), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.

class

The variable classes in data that you would like summaries for. Either "all" for all classes, or a character vector indicating which combinations of output classes you want. Specifying a subset will save time since summaries are only processed as needed. Options include "d" for dates, "f" for factors, "c" for character, "l" for logical, and "n" for numeric. If only a single class is requested or present in the data after excluding specified grouping variables, a data frame will be returned, otherwise you'll get a list of data frames (1 per summary class). If the only chosen class of variables is not detected in the input data an error will be returned that the class argument needs to be respecified.

digits

This determines the number of digits used for rounding of numeric outputs.

type

For numeric and integer vectors this determines the type of skewness and kurtosis calculations to perform. See skewness or skew and kurtosis or kurtosi for details.

na.rm

This determines whether missing values (NAs) should be removed before attempting to calculate summary statistics.

sep

A character string to use to separate unique values from their counts ("_" by default). Only applicable to factors and character vectors.

output

Output type for each class of variables. dt" for data.table or "tibble" for tibble.

Value

The output varies as a function of the class of input data/y, referred to as "y" below. Each output type is grouped together in a data frame and returned as a named item of a list, unless there is only one output type, in which case the data frame is returned directly.

For all input variables, the following are returned (part 1):

cases

the total number of cases

n

number of complete cases

na

the number of missing values

p_na

the proportion of total cases with missing values

In addition to part 1, these measures are provided for dates:

n_unique

the total number of unique values or levels of y. For dates this tells you how many time points there are

start

the earliest or minimum date in y

end

the latest or maximum date in y

In addition to part 1, these measures are provided for factors:

n_unique

the total number of unique values or levels of y

ordered

a logical indicating whether or not y is ordinal

counts_tb

the counts of the top and bottom unique values of y in order of decreasing frequency formatted as "value_count". If there are more than 4 unique values of y, only the top 2 and bottom 2 unique values are shown separated by "...". To get counts for all unique values use counts or counts_tb instead.

In addition to part 1, these measures are provided for character/string vectors:

n_unique

the total number of unique values or levels of y

min_chars

the minimum number of characters in the values of y

max_chars

the maximum number of characters in the values of y

counts_tb

the counts of the top and bottom unique values of y in order of decreasing frequency formatted as "value_count". If there are more than 4 unique values of y, only the top 2 and bottom 2 unique values are shown separated by "...". To get counts for all unique values use counts or counts_tb instead.

In addition to part 1, these measures are provided for logical vectors:

n_TRUE

the total number of y values that are TRUE

n_FALSE

the total number of y values that are FALSE

p_TRUE

the proportion of y values that are TRUE

In addition to part 1, these measures are provided for numeric variables:

mean

the mean of y

sd

the standard deviation of y

se

the standard error of the mean of y

p0

the 0th percentile (the minimum) of y

p25

the 25th percentile of y

p50

the 50th percentile (the median) of y

p75

the 25th percentile of y

p100

the 100th percentile (the maximum) of y

skew

the skewness of the distribution of y

kurt

the kurtosis of the distribution of y

See also

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


describe_all(mtcars)
#>     variable cases  n na p_na    mean      sd     se     p0     p25     p50
#>  1:      mpg    32 32  0    0  20.091   6.027  1.065 10.400  15.425  19.200
#>  2:      cyl    32 32  0    0   6.188   1.786  0.316  4.000   4.000   6.000
#>  3:     disp    32 32  0    0 230.722 123.939 21.909 71.100 120.825 196.300
#>  4:       hp    32 32  0    0 146.688  68.563 12.120 52.000  96.500 123.000
#>  5:     drat    32 32  0    0   3.597   0.535  0.095  2.760   3.080   3.695
#>  6:       wt    32 32  0    0   3.217   0.978  0.173  1.513   2.581   3.325
#>  7:     qsec    32 32  0    0  17.849   1.787  0.316 14.500  16.892  17.710
#>  8:       vs    32 32  0    0   0.438   0.504  0.089  0.000   0.000   0.000
#>  9:       am    32 32  0    0   0.406   0.499  0.088  0.000   0.000   0.000
#> 10:     gear    32 32  0    0   3.688   0.738  0.130  3.000   3.000   4.000
#> 11:     carb    32 32  0    0   2.812   1.615  0.286  1.000   2.000   2.000
#>        p75    p100   skew   kurt
#>  1:  22.80  33.900  0.672 -0.022
#>  2:   8.00   8.000 -0.192 -1.763
#>  3: 326.00 472.000  0.420 -1.068
#>  4: 180.00 335.000  0.799  0.275
#>  5:   3.92   4.930  0.293 -0.450
#>  6:   3.61   5.424  0.466  0.417
#>  7:  18.90  22.900  0.406  0.865
#>  8:   1.00   1.000  0.265 -2.063
#>  9:   1.00   1.000  0.401 -1.967
#> 10:   4.00   5.000  0.582 -0.895
#> 11:   4.00   8.000  1.157  2.020

if (FALSE) {
describe_all(pdata) #all summary types in a list

#numeric summary only
describe_all(pdata, high_low, output = "dt", class = "n")

#numeric and logical summaries only
describe_all(pdata, high_low, output = "dt", class = c("n", "l"))
}