Check the number of duplicated rows in a data frame.

Checks a data frame for duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified).dupes is a convenience shortcut for copies with the "filter" argument set to "dupes" and the "sort_by_copies" argument set to TRUE by default. For greater flexibility in checking row copy numbers or filtering for distinct rows, use copies instead. dupes behaves similarly to get_dupes) but is substantially faster due to the use of data.table as a backend.

dupes(
  data,
  ...,
  keep_all_cols = TRUE,
  sort_by_copies = TRUE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

data: a data frame, tibble, or data.table.
...: This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. x, y, z. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.
keep_all_cols: If column names are specified using ..., this allows you to drop unspecified columns, similarly to the .keep_all argument for `dplyr::distinct()``
sort_by_copies: If TRUE (the default), sorts the results by the number of copies, in order specified by the order argument.
order: If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
na_last: should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
output: "tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

A subset of the input data frame consisting of duplicated rows that were detected based on specified variables used to condition the search. A message will also be printed to the console indicating whether or not duplicates were detected. An n_copies column is appended specifying the total number of copies of each row that were detected.

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


# check for duplicates based on one variable, "g" in this case
dupes(pdata, g)
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#>       id d          g     high_low even     y1    y2    x1    x2    x3 n_copies
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>    <int>
#>  1     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284     2592
#>  2     6 2008-01-01 a     high     TRUE  102.  114.     19   163   206     2592
#>  3    11 2008-01-01 a     high     FALSE  93.8 102.     56   142   285     2592
#>  4    12 2008-01-01 a     low      TRUE   96.5  92.4   100   111   277     2592
#>  5    13 2008-01-01 a     low      FALSE  86.6  86.0    69   119   228     2592
#>  6    14 2008-01-01 a     high     TRUE   86.5 105.     44   182   290     2592
#>  7    15 2008-01-01 a     high     FALSE  99.6 101.     91   124   294     2592
#>  8    19 2008-01-01 a     high     FALSE  99.6 108.      3   139   275     2592
#>  9    21 2008-01-01 a     high     FALSE  95.6 103.     33   180   231     2592
#> 10    22 2008-01-01 a     high     TRUE   86.8 105.     92   117   251     2592
#> # … with 11,990 more rows

if (FALSE) {
dupes(pdata, high_low, g) #check based on 2 variables

# check based on all variables, i.e. fully duplicated rows
dupes(pdata)
}

Check the number of duplicated rows in a data frame.

Arguments

Value

See also

Author

Examples