Obtain a useful array of common summary statistics for a vector/variable with customized output depending on the class of variable. Uses a combination of tidyverse packages and data.table to provide a user-friendly interface that is pipe-friendly while leveraging the excellent performance of data.table. The use of the ... argument also makes it incredibly easy to obtain summaries split by grouping variables. While other similar functions exist in other packages (e.g. describeBy or skim), this version provides the some of the useful added outputs of the psych package (e.g. se, skew, and kurtosis for numeric variables) while at the same time offering slightly more concise syntax than skim (e.g. no preceding group_by operation is needed for group-wise calculations) while still achieving comparable processing times to the alternatives. To obtain summaries for all variables in a data frame use describe_all instead.

describe(
  data,
  y = NULL,
  ...,
  digits = 3,
  type = 2,
  na.rm = TRUE,
  sep = "_",
  output = c("tibble", "dt")
)

Arguments

data

Either a vector or a data frame or tibble containing the vector ("y") to be summarized and any grouping variables.

y

If the data object is a data.frame, this is the variable for which you wish to obtain a descriptive summary. You can use either the quoted or unquoted name of the variable, e.g. "y_var" or y_var.

...

If the data object is a data.frame, this special argument accepts any number of unquoted grouping variable names (also present in the data source) to use for subsetting, separated by commas, e.g. group_var1, group_var2. Also accepts a character vector of column names or index numbers, e.g. c("group_var1", "group_var2") or c(1, 2), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.

digits

This determines the number of digits used for rounding of numeric outputs.

type

For numeric and integer vectors this determines the type of skewness and kurtosis calculations to perform. See skewness or skew and kurtosis or kurtosi for details.

na.rm

This determines whether missing values (NAs) should be removed before attempting to calculate summary statistics.

sep

A character string to use to separate unique values from their counts ("_" by default). Only applicable to factors and character vectors.

output

Output type for each class of variables. dt" for data.table or "tibble" for tibble.

Value

The output varies as a function of the class of input data/y, referred to as "y" below

For all input variables, the following are returned (part 1):

cases

the total number of cases

n

number of complete cases

na

the number of missing values

p_na

the proportion of total cases with missing values

In addition to part 1, these measures are provided for dates:

n_unique

the total number of unique values or levels of y. For dates this tells you how many time points there are

start

the earliest or minimum date in y

end

the latest or maximum date in y

In addition to part 1, these measures are provided for factors:

n_unique

the total number of unique values or levels of y

ordered

a logical indicating whether or not y is ordinal

counts_tb

the counts of the top and bottom unique values of y in order of decreasing frequency formatted as "value_count". If there are more than 4 unique values of y, only the top 2 and bottom 2 unique values are shown separated by "...". To get counts for all unique values use counts instead.

In addition to part 1, these measures are provided for character/string vectors:

n_unique

the total number of unique values or levels of y

min_chars

the minimum number of characters in the values of y

max_chars

the maximum number of characters in the values of y

counts_tb

the counts of the top and bottom unique values of y in order of decreasing frequency formatted as "value_count". If there are more than 4 unique values of y, only the top 2 and bottom 2 unique values are shown separated by "...". To get counts for all unique values use counts instead.

In addition to part 1, these measures are provided for logical vectors:

n_TRUE

the total number of y values that are TRUE

n_FALSE

the total number of y values that are FALSE

p_TRUE

the proportion of y values that are TRUE

In addition to part 1, these measures are provided for numeric variables:

mean

the mean of y

sd

the standard deviation of y

se

the standard error of the mean of y

p0

the 0th percentile (the minimum) of y

p25

the 25th percentile of y

p50

the 50th percentile (the median) of y

p75

the 25th percentile of y

p100

the 100th percentile (the maximum) of y

skew

the skewness of the distribution of y

kurt

the kurtosis of the distribution of y

References

Altman, D. G., & Bland, J. M. (2005). Standard deviations and standard errors. Bmj, 331(7521), 903.

Bulmer, M. G. (1979). Principles of statistics. Courier Corporation.

D. N. Joanes and C. A. Gill (1998), Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


describe(data = pdata, y = y1) #no grouping variables, numeric input class
#> # A tibble: 1 × 14
#>   cases     n    na  p_na  mean    sd    se    p0   p25   p50   p75  p100  skew
#>   <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12000 12000     0     0  154.  42.7  0.39  69.2  121.  145.  181.  289. 0.739
#> # … with 1 more variable: kurt <dbl>
describe(pdata, y1, high_low) #one grouping variable, numeric input class
#> # A tibble: 2 × 15
#>   high_low cases     n    na  p_na  mean    sd    se    p0   p25   p50   p75
#>   <chr>    <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 high      6045  6045     0     0  154.  42.9 0.552  70.7  121.  145.  182.
#> 2 low       5955  5955     0     0  154.  42.5 0.551  69.2  121.  145.  180.
#> # … with 3 more variables: p100 <dbl>, skew <dbl>, kurt <dbl>
describe(pdata, g) #factor input class
#> # A tibble: 1 × 7
#>   cases     n    na  p_na n_unique ordered counts_tb                          
#>   <int> <int> <int> <dbl>    <int> <lgl>   <chr>                              
#> 1 12000 12000     0     0        5 FALSE   a_2592, b_2460, ..., e_2352, c_2220
describe(pdata, even) #logical input class
#> # A tibble: 1 × 7
#>   cases     n    na  p_na n_TRUE n_FALSE p_TRUE
#>   <int> <int> <int> <dbl>  <dbl>   <dbl>  <dbl>
#> 1 12000 12000     0     0   6000    6000    0.5