Checks a data frame for copied/duplicated rows based on
specified variables to use for checking (via ...
) or all columns (if
unspecified). Also allows filtering of the output to retain all records
with copy # info, a subset of distinct records, or a subset of duplicated
records. This flexibility makes copies
similar to both
get_dupes
) & distinct
), while
at the same time providing greater flexibility through a larger array of
output options and competitive performance by using data.table
as a
backend. dupes
is also available as a convenience shortcut
for copies(filter = "dupes", sort_by_copies = TRUE)
.
a data frame, tibble, or data.table.
This special argument accepts any number of unquoted column names
(also present in the data source) to use when searching for duplicates,
e.g. x, y, z
. Also accepts a character vector of column names or index
numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats
in the same call. If no column names are specified, all columns will be
used.
Shortcuts for filtering (retaining a subset of) the rows of the
output based on the number of copies detected. Options include: "all"
=
all rows that were present in the input (default), "dupes"
= only rows
that were found to be duplicated (mimics the behaviour of
get_dupes
), "unique"
= only rows that appear as a
single copy (not duplicated at all), "first"
= keeps the 1st copy in
cases where duplicates are detected (mimics the behaviour of
distinct
& unique
), and "last"
= keeps
the last copy in cases where duplicates are detected. Note: if "dupes"
is
selected a message will be printed to the console indicating whether or
not duplicates were detected.
If column names are specified using ...
, this allows
you to drop unspecified columns, similarly to the .keep_all
argument for
`dplyr::distinct()``
Only applicable to the "all" & "dupes" filtering
options. If TRUE, sorts the results by the number of copies, in order
specified by the order
argument. Default is FALSE to maximize
performance.
Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.
If filter
argument is set to "all", returns a modified version of the
input data frame with two additional columns added to the end/right side:
- `copy_number` = the row copy number which is included to allow
subsequent filtering based on the 1st or last copy detected.
- `n_copies` = the total number of copies detected
If filter
is set to dupes
, then only the n_copies
column is appended
and only duplicated rows are returned. If any other of the other filter
argument options are chosen, only the chosen subset of the rows & columns will be returned.
# check based on one variable & return all rows with copy indicators
copies(pdata, g, filter = "all") #the default
#> # A tibble: 12,000 × 12
#> id d g high_low even y1 y2 x1 x2 x3 copy_nu…¹
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 1 2008-01-01 e high FALSE 106. 118. 59 116 248 1
#> 2 2 2008-01-01 c high TRUE 96.5 107. 5 101 238 1
#> 3 3 2008-01-01 d low FALSE 99.3 96.2 71 111 250 1
#> 4 4 2008-01-01 c high TRUE 109. 102. 60 130 287 2
#> 5 5 2008-01-01 a high FALSE 99.7 113. 96 196 284 1
#> 6 6 2008-01-01 a high TRUE 102. 114. 19 163 206 2
#> 7 7 2008-01-01 d low FALSE 91.0 87.9 77 133 201 2
#> 8 8 2008-01-01 b low TRUE 109. 98.7 74 191 249 1
#> 9 9 2008-01-01 e low FALSE 99.8 89.8 92 106 277 2
#> 10 10 2008-01-01 c low TRUE 122. 83.6 4 134 209 3
#> # … with 11,990 more rows, 1 more variable: n_copies <int>, and abbreviated
#> # variable name ¹copy_number
# check based on one variable & return duplicated rows only
copies(pdata, g, filter = "dupes")
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#> id d g high_low even y1 y2 x1 x2 x3 n_copies
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 1 2008-01-01 e high FALSE 106. 118. 59 116 248 2352
#> 2 2 2008-01-01 c high TRUE 96.5 107. 5 101 238 2220
#> 3 3 2008-01-01 d low FALSE 99.3 96.2 71 111 250 2376
#> 4 4 2008-01-01 c high TRUE 109. 102. 60 130 287 2220
#> 5 5 2008-01-01 a high FALSE 99.7 113. 96 196 284 2592
#> 6 6 2008-01-01 a high TRUE 102. 114. 19 163 206 2592
#> 7 7 2008-01-01 d low FALSE 91.0 87.9 77 133 201 2376
#> 8 8 2008-01-01 b low TRUE 109. 98.7 74 191 249 2460
#> 9 9 2008-01-01 e low FALSE 99.8 89.8 92 106 277 2352
#> 10 10 2008-01-01 c low TRUE 122. 83.6 4 134 209 2220
#> # … with 11,990 more rows
# check based on one variable & return distinct/unique rows only
copies(pdata, g, filter = "unique")
#> # A tibble: 0 × 10
#> # … with 10 variables: id <int>, d <date>, g <fct>, high_low <chr>, even <lgl>,
#> # y1 <dbl>, y2 <dbl>, x1 <int>, x2 <int>, x3 <int>
# check based on one variable & return the 1st detected copy for cases where
# more than one copy is detected (like `dplyr::distinct()` or `unique()`)
copies(pdata, g, filter = "first")
#> # A tibble: 5 × 10
#> id d g high_low even y1 y2 x1 x2 x3
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int>
#> 1 1 2008-01-01 e high FALSE 106. 118. 59 116 248
#> 2 2 2008-01-01 c high TRUE 96.5 107. 5 101 238
#> 3 3 2008-01-01 d low FALSE 99.3 96.2 71 111 250
#> 4 5 2008-01-01 a high FALSE 99.7 113. 96 196 284
#> 5 8 2008-01-01 b low TRUE 109. 98.7 74 191 249
# check based on one variable & return the last detected copy for cases where
# more than one copy is detected (like `unique()` with fromLast = TRUE`)
copies(pdata, g, filter = "last")
#> # A tibble: 5 × 10
#> id d g high_low even y1 y2 x1 x2 x3
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int>
#> 1 992 2019-01-01 d low TRUE 245. 96.7 44 185 212
#> 2 993 2019-01-01 b high FALSE 195. 114. 71 129 239
#> 3 997 2019-01-01 a high FALSE 171. 103. 12 172 221
#> 4 998 2019-01-01 e high TRUE 154. 116. 35 115 254
#> 5 1000 2019-01-01 c low TRUE 280. 84.3 50 118 278
if (FALSE) {
copies(pdata, high_low, g) #check based on 2 variables
copies(pdata) #check based on all columns
}