Checks a data frame for copied/duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified). Also allows filtering of the output to retain all records with copy # info, a subset of distinct records, or a subset of duplicated records. This flexibility makes copies similar to both get_dupes) & distinct), while at the same time providing greater flexibility through a larger array of output options and competitive performance by using data.table as a backend. dupes is also available as a convenience shortcut for copies(filter = "dupes", sort_by_copies = TRUE).

copies(
  data,
  ...,
  filter = c("all", "dupes", "first", "last", "unique"),
  keep_all_cols = TRUE,
  sort_by_copies = FALSE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

data

a data frame, tibble, or data.table.

...

This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. x, y, z. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.

filter

Shortcuts for filtering (retaining a subset of) the rows of the output based on the number of copies detected. Options include: "all" = all rows that were present in the input (default), "dupes" = only rows that were found to be duplicated (mimics the behaviour of get_dupes), "unique" = only rows that appear as a single copy (not duplicated at all), "first" = keeps the 1st copy in cases where duplicates are detected (mimics the behaviour of distinct & unique), and "last" = keeps the last copy in cases where duplicates are detected. Note: if "dupes" is selected a message will be printed to the console indicating whether or not duplicates were detected.

keep_all_cols

If column names are specified using ..., this allows you to drop unspecified columns, similarly to the .keep_all argument for `dplyr::distinct()``

sort_by_copies

Only applicable to the "all" & "dupes" filtering options. If TRUE, sorts the results by the number of copies, in order specified by the order argument. Default is FALSE to maximize performance.

order

Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.

na_last

should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.

output

"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

If filter argument is set to "all", returns a modified version of the input data frame with two additional columns added to the end/right side:

- `copy_number` = the row copy number which is included to allow
subsequent filtering based on the 1st or last copy detected.

- `n_copies` = the total number of copies detected

If filter is set to dupes, then only the n_copies column is appended and only duplicated rows are returned. If any other of the other filter

argument options are chosen, only the chosen subset of the rows & columns will be returned.

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


# check based on one variable & return all rows with copy indicators
copies(pdata, g, filter = "all") #the default
#> # A tibble: 12,000 × 12
#>       id d          g     high_low even     y1    y2    x1    x2    x3 copy_nu…¹
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>     <int>
#>  1     1 2008-01-01 e     high     FALSE 106.  118.     59   116   248         1
#>  2     2 2008-01-01 c     high     TRUE   96.5 107.      5   101   238         1
#>  3     3 2008-01-01 d     low      FALSE  99.3  96.2    71   111   250         1
#>  4     4 2008-01-01 c     high     TRUE  109.  102.     60   130   287         2
#>  5     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284         1
#>  6     6 2008-01-01 a     high     TRUE  102.  114.     19   163   206         2
#>  7     7 2008-01-01 d     low      FALSE  91.0  87.9    77   133   201         2
#>  8     8 2008-01-01 b     low      TRUE  109.   98.7    74   191   249         1
#>  9     9 2008-01-01 e     low      FALSE  99.8  89.8    92   106   277         2
#> 10    10 2008-01-01 c     low      TRUE  122.   83.6     4   134   209         3
#> # … with 11,990 more rows, 1 more variable: n_copies <int>, and abbreviated
#> #   variable name ¹​copy_number

# check based on one variable & return duplicated rows only
copies(pdata, g, filter = "dupes")
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#>       id d          g     high_low even     y1    y2    x1    x2    x3 n_copies
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>    <int>
#>  1     1 2008-01-01 e     high     FALSE 106.  118.     59   116   248     2352
#>  2     2 2008-01-01 c     high     TRUE   96.5 107.      5   101   238     2220
#>  3     3 2008-01-01 d     low      FALSE  99.3  96.2    71   111   250     2376
#>  4     4 2008-01-01 c     high     TRUE  109.  102.     60   130   287     2220
#>  5     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284     2592
#>  6     6 2008-01-01 a     high     TRUE  102.  114.     19   163   206     2592
#>  7     7 2008-01-01 d     low      FALSE  91.0  87.9    77   133   201     2376
#>  8     8 2008-01-01 b     low      TRUE  109.   98.7    74   191   249     2460
#>  9     9 2008-01-01 e     low      FALSE  99.8  89.8    92   106   277     2352
#> 10    10 2008-01-01 c     low      TRUE  122.   83.6     4   134   209     2220
#> # … with 11,990 more rows

# check based on one variable & return distinct/unique rows only
copies(pdata, g, filter = "unique")
#> # A tibble: 0 × 10
#> # … with 10 variables: id <int>, d <date>, g <fct>, high_low <chr>, even <lgl>,
#> #   y1 <dbl>, y2 <dbl>, x1 <int>, x2 <int>, x3 <int>

# check based on one variable & return the 1st detected copy for cases where
# more than one copy is detected (like `dplyr::distinct()` or `unique()`)
copies(pdata, g, filter = "first")
#> # A tibble: 5 × 10
#>      id d          g     high_low even     y1    y2    x1    x2    x3
#>   <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>
#> 1     1 2008-01-01 e     high     FALSE 106.  118.     59   116   248
#> 2     2 2008-01-01 c     high     TRUE   96.5 107.      5   101   238
#> 3     3 2008-01-01 d     low      FALSE  99.3  96.2    71   111   250
#> 4     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284
#> 5     8 2008-01-01 b     low      TRUE  109.   98.7    74   191   249

# check based on one variable & return the last detected copy for cases where
# more than one copy is detected (like `unique()` with fromLast = TRUE`)
copies(pdata, g, filter = "last")
#> # A tibble: 5 × 10
#>      id d          g     high_low even     y1    y2    x1    x2    x3
#>   <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>
#> 1   992 2019-01-01 d     low      TRUE   245.  96.7    44   185   212
#> 2   993 2019-01-01 b     high     FALSE  195. 114.     71   129   239
#> 3   997 2019-01-01 a     high     FALSE  171. 103.     12   172   221
#> 4   998 2019-01-01 e     high     TRUE   154. 116.     35   115   254
#> 5  1000 2019-01-01 c     low      TRUE   280.  84.3    50   118   278

if (FALSE) {
copies(pdata, high_low, g) #check based on 2 variables

copies(pdata) #check based on all columns
}