Check the number of copies/duplicated rows in a data frame.

Checks a data frame for copied/duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified). Also allows filtering of the output to retain all records with copy # info, a subset of distinct records, or a subset of duplicated records. This flexibility makes copies similar to both get_dupes) & distinct), while at the same time providing greater flexibility through a larger array of output options and competitive performance by using data.table as a backend. dupes is also available as a convenience shortcut for copies(filter = "dupes", sort_by_copies = TRUE).

copies(
  data,
  ...,
  filter = c("all", "dupes", "first", "last", "unique"),
  keep_all_cols = TRUE,
  sort_by_copies = FALSE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

data: a data frame, tibble, or data.table.
...: This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. x, y, z. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.
filter: Shortcuts for filtering (retaining a subset of) the rows of the output based on the number of copies detected. Options include: "all" = all rows that were present in the input (default), "dupes" = only rows that were found to be duplicated (mimics the behaviour of get_dupes), "unique" = only rows that appear as a single copy (not duplicated at all), "first" = keeps the 1st copy in cases where duplicates are detected (mimics the behaviour of distinct & unique), and "last" = keeps the last copy in cases where duplicates are detected. Note: if "dupes" is selected a message will be printed to the console indicating whether or not duplicates were detected.
keep_all_cols: If column names are specified using ..., this allows you to drop unspecified columns, similarly to the .keep_all argument for `dplyr::distinct()``
sort_by_copies: Only applicable to the "all" & "dupes" filtering options. If TRUE, sorts the results by the number of copies, in order specified by the order argument. Default is FALSE to maximize performance.
order: Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
na_last: should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
output: "tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

If filter argument is set to "all", returns a modified version of the input data frame with two additional columns added to the end/right side:

- `copy_number` = the row copy number which is included to allow
subsequent filtering based on the 1st or last copy detected.

- `n_copies` = the total number of copies detected

If filter is set to dupes, then only the n_copies column is appended and only duplicated rows are returned. If any other of the other filter

argument options are chosen, only the chosen subset of the rows & columns will be returned.

Author

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


# check based on one variable & return all rows with copy indicators
copies(pdata, g, filter = "all") #the default
#> # A tibble: 12,000 × 12
#>       id d          g     high_low even     y1    y2    x1    x2    x3 copy_nu…¹
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>     <int>
#>  1     1 2008-01-01 e     high     FALSE 106.  118.     59   116   248         1
#>  2     2 2008-01-01 c     high     TRUE   96.5 107.      5   101   238         1
#>  3     3 2008-01-01 d     low      FALSE  99.3  96.2    71   111   250         1
#>  4     4 2008-01-01 c     high     TRUE  109.  102.     60   130   287         2
#>  5     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284         1
#>  6     6 2008-01-01 a     high     TRUE  102.  114.     19   163   206         2
#>  7     7 2008-01-01 d     low      FALSE  91.0  87.9    77   133   201         2
#>  8     8 2008-01-01 b     low      TRUE  109.   98.7    74   191   249         1
#>  9     9 2008-01-01 e     low      FALSE  99.8  89.8    92   106   277         2
#> 10    10 2008-01-01 c     low      TRUE  122.   83.6     4   134   209         3
#> # … with 11,990 more rows, 1 more variable: n_copies <int>, and abbreviated
#> #   variable name ¹copy_number

# check based on one variable & return duplicated rows only
copies(pdata, g, filter = "dupes")
#> Duplicated rows detected! 12000 of 12000 rows in the input data have multiple copies.
#> # A tibble: 12,000 × 11
#>       id d          g     high_low even     y1    y2    x1    x2    x3 n_copies
#>    <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>    <int>
#>  1     1 2008-01-01 e     high     FALSE 106.  118.     59   116   248     2352
#>  2     2 2008-01-01 c     high     TRUE   96.5 107.      5   101   238     2220
#>  3     3 2008-01-01 d     low      FALSE  99.3  96.2    71   111   250     2376
#>  4     4 2008-01-01 c     high     TRUE  109.  102.     60   130   287     2220
#>  5     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284     2592
#>  6     6 2008-01-01 a     high     TRUE  102.  114.     19   163   206     2592
#>  7     7 2008-01-01 d     low      FALSE  91.0  87.9    77   133   201     2376
#>  8     8 2008-01-01 b     low      TRUE  109.   98.7    74   191   249     2460
#>  9     9 2008-01-01 e     low      FALSE  99.8  89.8    92   106   277     2352
#> 10    10 2008-01-01 c     low      TRUE  122.   83.6     4   134   209     2220
#> # … with 11,990 more rows

# check based on one variable & return distinct/unique rows only
copies(pdata, g, filter = "unique")
#> # A tibble: 0 × 10
#> # … with 10 variables: id <int>, d <date>, g <fct>, high_low <chr>, even <lgl>,
#> #   y1 <dbl>, y2 <dbl>, x1 <int>, x2 <int>, x3 <int>

# check based on one variable & return the 1st detected copy for cases where
# more than one copy is detected (like `dplyr::distinct()` or `unique()`)
copies(pdata, g, filter = "first")
#> # A tibble: 5 × 10
#>      id d          g     high_low even     y1    y2    x1    x2    x3
#>   <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>
#> 1     1 2008-01-01 e     high     FALSE 106.  118.     59   116   248
#> 2     2 2008-01-01 c     high     TRUE   96.5 107.      5   101   238
#> 3     3 2008-01-01 d     low      FALSE  99.3  96.2    71   111   250
#> 4     5 2008-01-01 a     high     FALSE  99.7 113.     96   196   284
#> 5     8 2008-01-01 b     low      TRUE  109.   98.7    74   191   249

# check based on one variable & return the last detected copy for cases where
# more than one copy is detected (like `unique()` with fromLast = TRUE`)
copies(pdata, g, filter = "last")
#> # A tibble: 5 × 10
#>      id d          g     high_low even     y1    y2    x1    x2    x3
#>   <int> <date>     <fct> <chr>    <lgl> <dbl> <dbl> <int> <int> <int>
#> 1   992 2019-01-01 d     low      TRUE   245.  96.7    44   185   212
#> 2   993 2019-01-01 b     high     FALSE  195. 114.     71   129   239
#> 3   997 2019-01-01 a     high     FALSE  171. 103.     12   172   221
#> 4   998 2019-01-01 e     high     TRUE   154. 116.     35   115   254
#> 5  1000 2019-01-01 c     low      TRUE   280.  84.3    50   118   278

if (FALSE) {
copies(pdata, high_low, g) #check based on 2 variables

copies(pdata) #check based on all columns
}

Check the number of copies/duplicated rows in a data frame.

Arguments

Value

See also

Author

Examples