Checks a data frame for copied/duplicated rows based on
specified variables to use for checking (via ...) or all columns (if
unspecified). Also allows filtering of the output to retain all records
with copy # info, a subset of distinct records, or a subset of duplicated
records. This flexibility makes copies similar to both
get_dupes) & distinct), while
at the same time providing greater flexibility through a larger array of
output options and competitive performance by using data.table as a
backend. dupes is also available as a convenience shortcut
for copies(filter = "dupes", sort_by_copies = TRUE).
a data frame, tibble, or data.table.
This special argument accepts any number of unquoted column names
(also present in the data source) to use when searching for duplicates,
e.g. x, y, z. Also accepts a character vector of column names or index
numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats
in the same call. If no column names are specified, all columns will be
used.
Shortcuts for filtering (retaining a subset of) the rows of the
output based on the number of copies detected. Options include: "all" =
all rows that were present in the input (default), "dupes" = only rows
that were found to be duplicated (mimics the behaviour of
get_dupes), "unique" = only rows that appear as a
single copy (not duplicated at all), "first" = keeps the 1st copy in
cases where duplicates are detected (mimics the behaviour of
distinct & unique), and "last" = keeps
the last copy in cases where duplicates are detected. Note: if "dupes" is
selected a message will be printed to the console indicating whether or
not duplicates were detected.
If column names are specified using ..., this allows
you to drop unspecified columns, similarly to the .keep_all argument for
`dplyr::distinct()``
Only applicable to the "all" & "dupes" filtering
options. If TRUE, sorts the results by the number of copies, in order
specified by the order argument. Default is FALSE to maximize
performance.
Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.
If filter argument is set to "all", returns a modified version of the
input data frame with two additional columns added to the end/right side:
- `copy_number` = the row copy number which is included to allow
subsequent filtering based on the 1st or last copy detected.
- `n_copies` = the total number of copies detected
If filter is set to dupes, then only the n_copies column is appended
and only duplicated rows are returned. If any other of the other filter
argument options are chosen, only the chosen subset of the rows & columns will be returned.
# check based on one variable & return all rows with copy indicators
copies(pdata, g, filter = "all") #the default
#> # A tibble: 12,000 × 12
#> id d g high_low even y1 y2 x1 x2 x3 copy_nu…¹
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 1 2008-01-01 e high FALSE 106. 118. 59 116 248 1
#> 2 2 2008-01-01 c high TRUE 96.5 107. 5 101 238 1
#> 3 3 2008-01-01 d low FALSE 99.3 96.2 71 111 250 1
#> 4 4 2008-01-01 c high TRUE 109. 102. 60 130 287 2
#> 5 5 2008-01-01 a high FALSE 99.7 113. 96 196 284 1
#> 6 6 2008-01-01 a high TRUE 102. 114. 19 163 206 2
#> 7 7 2008-01-01 d low FALSE 91.0 87.9 77 133 201 2
#> 8 8 2008-01-01 b low TRUE 109. 98.7 74 191 249 1
#> 9 9 2008-01-01 e low FALSE 99.8 89.8 92 106 277 2
#> 10 10 2008-01-01 c low TRUE 122. 83.6 4 134 209 3
#> # … with 11,990 more rows, 1 more variable: n_copies <int>, and abbreviated
#> # variable name ¹copy_number
# check based on one variable & return duplicated rows only
copies(pdata, g, filter = "dupes")
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#> id d g high_low even y1 y2 x1 x2 x3 n_copies
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 1 2008-01-01 e high FALSE 106. 118. 59 116 248 2352
#> 2 2 2008-01-01 c high TRUE 96.5 107. 5 101 238 2220
#> 3 3 2008-01-01 d low FALSE 99.3 96.2 71 111 250 2376
#> 4 4 2008-01-01 c high TRUE 109. 102. 60 130 287 2220
#> 5 5 2008-01-01 a high FALSE 99.7 113. 96 196 284 2592
#> 6 6 2008-01-01 a high TRUE 102. 114. 19 163 206 2592
#> 7 7 2008-01-01 d low FALSE 91.0 87.9 77 133 201 2376
#> 8 8 2008-01-01 b low TRUE 109. 98.7 74 191 249 2460
#> 9 9 2008-01-01 e low FALSE 99.8 89.8 92 106 277 2352
#> 10 10 2008-01-01 c low TRUE 122. 83.6 4 134 209 2220
#> # … with 11,990 more rows
# check based on one variable & return distinct/unique rows only
copies(pdata, g, filter = "unique")
#> # A tibble: 0 × 10
#> # … with 10 variables: id <int>, d <date>, g <fct>, high_low <chr>, even <lgl>,
#> # y1 <dbl>, y2 <dbl>, x1 <int>, x2 <int>, x3 <int>
# check based on one variable & return the 1st detected copy for cases where
# more than one copy is detected (like `dplyr::distinct()` or `unique()`)
copies(pdata, g, filter = "first")
#> # A tibble: 5 × 10
#> id d g high_low even y1 y2 x1 x2 x3
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int>
#> 1 1 2008-01-01 e high FALSE 106. 118. 59 116 248
#> 2 2 2008-01-01 c high TRUE 96.5 107. 5 101 238
#> 3 3 2008-01-01 d low FALSE 99.3 96.2 71 111 250
#> 4 5 2008-01-01 a high FALSE 99.7 113. 96 196 284
#> 5 8 2008-01-01 b low TRUE 109. 98.7 74 191 249
# check based on one variable & return the last detected copy for cases where
# more than one copy is detected (like `unique()` with fromLast = TRUE`)
copies(pdata, g, filter = "last")
#> # A tibble: 5 × 10
#> id d g high_low even y1 y2 x1 x2 x3
#> <int> <date> <fct> <chr> <lgl> <dbl> <dbl> <int> <int> <int>
#> 1 992 2019-01-01 d low TRUE 245. 96.7 44 185 212
#> 2 993 2019-01-01 b high FALSE 195. 114. 71 129 239
#> 3 997 2019-01-01 a high FALSE 171. 103. 12 172 221
#> 4 998 2019-01-01 e high TRUE 154. 116. 35 115 254
#> 5 1000 2019-01-01 c low TRUE 280. 84.3 50 118 278
if (FALSE) {
copies(pdata, high_low, g) #check based on 2 variables
copies(pdata) #check based on all columns
}