Checks a data frame for duplicated rows based on specified
variables to use for checking (via ...) or all columns (if
unspecified).dupes is a convenience shortcut for copies
with the "filter" argument set to "dupes" and the "sort_by_copies" argument
set to TRUE by default. For greater flexibility in checking row copy
numbers or filtering for distinct rows, use copies instead.
dupes behaves similarly to get_dupes) but is
substantially faster due to the use of data.table as a backend.
a data frame, tibble, or data.table.
This special argument accepts any number of unquoted column names
(also present in the data source) to use when searching for duplicates,
e.g. x, y, z. Also accepts a character vector of column names or index
numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats
in the same call. If no column names are specified, all columns will be
used.
If column names are specified using ..., this allows
you to drop unspecified columns, similarly to the .keep_all argument for
`dplyr::distinct()``
If TRUE (the default), sorts the results by the number
of copies, in order specified by the order argument.
If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.
A subset of the input data frame consisting of duplicated rows that were
detected based on specified variables used to condition the search. A
message will also be printed to the console indicating whether or not
duplicates were detected. An n_copies column is appended specifying the
total number of copies of each row that were detected.
# check for duplicates based on one variable, "g" in this case
dupes(pdata, g)
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#> id d g high_low even y1 y2 x1 x2 x3 n_copies
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 5 2008-01-01 a high FALSE 99.7 113. 96 196 284 2592
#> 2 6 2008-01-01 a high TRUE 102. 114. 19 163 206 2592
#> 3 11 2008-01-01 a high FALSE 93.8 102. 56 142 285 2592
#> 4 12 2008-01-01 a low TRUE 96.5 92.4 100 111 277 2592
#> 5 13 2008-01-01 a low FALSE 86.6 86.0 69 119 228 2592
#> 6 14 2008-01-01 a high TRUE 86.5 105. 44 182 290 2592
#> 7 15 2008-01-01 a high FALSE 99.6 101. 91 124 294 2592
#> 8 19 2008-01-01 a high FALSE 99.6 108. 3 139 275 2592
#> 9 21 2008-01-01 a high FALSE 95.6 103. 33 180 231 2592
#> 10 22 2008-01-01 a high TRUE 86.8 105. 92 117 251 2592
#> # … with 11,990 more rows
if (FALSE) {
dupes(pdata, high_low, g) #check based on 2 variables
# check based on all variables, i.e. fully duplicated rows
dupes(pdata)
}