Function to convert into formats provided by Apache Arrow. These functions will overwrite files of the same name by default. Convenience wrappers are provided for each arrow format.
dat_to_arrow_formats(
data_path,
data_dict,
output_dir,
arrow_format,
col_types = NULL,
col_select = NULL,
overwrite = TRUE,
data_format = c("fwf", "csv", "tsv", "csv2"),
tz = "UTC",
date_format = "%AD",
time_format = "%AT",
...
)
dat_to_parquet(...)
dat_to_arrow(...)
dat_to_feather(...)
A path or a vector of paths to a .dat.gz
file. If supplying a vector of paths,
they must share a common data dictionary.
A data.frame with start
, stop
and name
columns
path to where you want to save the output file
must be one of parquet, arrow or feather
One of NULL
, a cols()
specification, or
a string. See vignette("readr")
for more details.
If NULL
, all column types will be imputed from guess_max
rows
on the input interspersed throughout the file. This is convenient (and
fast), but not robust. If the imputation fails, you'll need to increase
the guess_max
or supply the correct types yourself.
Column specifications created by list()
or cols()
must contain
one column specification for each column. If you only want to read a
subset of the columns, use cols_only()
.
Alternatively, you can use a compact string representation where each character represents one column:
c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip
By default, reading a file without a column specification will print a
message showing what readr
guessed they were. To remove this message,
set show_col_types = FALSE
or set `options(readr.show_col_types = FALSE).
A vector of column names
logical; should existing destination files be overwritten?
the format of the input data. Default is "fwf"
, other choices
are "csv"
, "csv2"
, "tsv"
what timezone should datetime fields use? Default UTC. This is recommended to avoid timezone pain, but remember that the data is in UTC when doing analysis. See OlsonNames() for list of available timezones.
date format for columns where date format is not specified in col_types
time format for columns where time format is not specified in col_types
passed to one or arrows writing functions depending on the value of arrow_format
dat_to_parquet
: Convert to parquet
dat_to_arrow
: Convert to arrow
dat_to_feather
: Convert to feather