Checks a data frame for duplicated rows based on specified
variables to use for checking (via ...
) or all columns (if
unspecified).dupes
is a convenience shortcut for copies
with the "filter" argument set to "dupes" and the "sort_by_copies" argument
set to TRUE by default. For greater flexibility in checking row copy
numbers or filtering for distinct rows, use copies
instead.
dupes
behaves similarly to get_dupes
) but is
substantially faster due to the use of data.table
as a backend.
a data frame, tibble, or data.table.
This special argument accepts any number of unquoted column names
(also present in the data source) to use when searching for duplicates,
e.g. x, y, z
. Also accepts a character vector of column names or index
numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats
in the same call. If no column names are specified, all columns will be
used.
If column names are specified using ...
, this allows
you to drop unspecified columns, similarly to the .keep_all
argument for
`dplyr::distinct()``
If TRUE (the default), sorts the results by the number
of copies, in order specified by the order
argument.
If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.
A subset of the input data frame consisting of duplicated rows that were
detected based on specified variables used to condition the search. A
message will also be printed to the console indicating whether or not
duplicates were detected. An n_copies
column is appended specifying the
total number of copies of each row that were detected.
# check for duplicates based on one variable, "g" in this case
dupes(pdata, g)
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#> id d g high_low even y1 y2 x1 x2 x3 n_copies
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 5 2008-01-01 a high FALSE 99.7 113. 96 196 284 2592
#> 2 6 2008-01-01 a high TRUE 102. 114. 19 163 206 2592
#> 3 11 2008-01-01 a high FALSE 93.8 102. 56 142 285 2592
#> 4 12 2008-01-01 a low TRUE 96.5 92.4 100 111 277 2592
#> 5 13 2008-01-01 a low FALSE 86.6 86.0 69 119 228 2592
#> 6 14 2008-01-01 a high TRUE 86.5 105. 44 182 290 2592
#> 7 15 2008-01-01 a high FALSE 99.6 101. 91 124 294 2592
#> 8 19 2008-01-01 a high FALSE 99.6 108. 3 139 275 2592
#> 9 21 2008-01-01 a high FALSE 95.6 103. 33 180 231 2592
#> 10 22 2008-01-01 a high TRUE 86.8 105. 92 117 251 2592
#> # … with 11,990 more rows
if (FALSE) {
dupes(pdata, high_low, g) #check based on 2 variables
# check based on all variables, i.e. fully duplicated rows
dupes(pdata)
}