Checks a data frame for duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified).dupes is a convenience shortcut for copies with the "filter" argument set to "dupes" and the "sort_by_copies" argument set to TRUE by default. For greater flexibility in checking row copy numbers or filtering for distinct rows, use copies instead. dupes behaves similarly to get_dupes) but is substantially faster due to the use of data.table as a backend.

dupes(
  data,
  ...,
  keep_all_cols = TRUE,
  sort_by_copies = TRUE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

data

a data frame, tibble, or data.table.

...

This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. x, y, z. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.

keep_all_cols

If column names are specified using ..., this allows you to drop unspecified columns, similarly to the .keep_all argument for `dplyr::distinct()``

sort_by_copies

If TRUE (the default), sorts the results by the number of copies, in order specified by the order argument.

order

If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.

na_last

should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.

output

"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

A subset of the input data frame consisting of duplicated rows that were detected based on specified variables used to condition the search. A message will also be printed to the console indicating whether or not duplicates were detected. An n_copies column is appended specifying the total number of copies of each row that were detected.

See also

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


# check for duplicates based on one variable, "g" in this case
dupes(pdata, g)
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#>       id d          g     high_low even     y1    y2    x1    x2    x3 n_copies
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>    <int>
#>  1     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284     2592
#>  2     6 2008-01-01 a     high     TRUE  102.  114.     19   163   206     2592
#>  3    11 2008-01-01 a     high     FALSE  93.8 102.     56   142   285     2592
#>  4    12 2008-01-01 a     low      TRUE   96.5  92.4   100   111   277     2592
#>  5    13 2008-01-01 a     low      FALSE  86.6  86.0    69   119   228     2592
#>  6    14 2008-01-01 a     high     TRUE   86.5 105.     44   182   290     2592
#>  7    15 2008-01-01 a     high     FALSE  99.6 101.     91   124   294     2592
#>  8    19 2008-01-01 a     high     FALSE  99.6 108.      3   139   275     2592
#>  9    21 2008-01-01 a     high     FALSE  95.6 103.     33   180   231     2592
#> 10    22 2008-01-01 a     high     TRUE   86.8 105.     92   117   251     2592
#> # … with 11,990 more rows

if (FALSE) {
dupes(pdata, high_low, g) #check based on 2 variables

# check based on all variables, i.e. fully duplicated rows
dupes(pdata)
}