Convert .dat.gz files into arrow formats — dat_to_arrow

Function to convert into formats provided by Apache Arrow. These functions will overwrite files of the same name by default. Convenience wrappers are provided for each arrow format.

dat_to_arrow_formats(
  data_path,
  data_dict,
  output_dir,
  arrow_format,
  col_types = NULL,
  col_select = NULL,
  overwrite = TRUE,
  data_format = c("fwf", "csv", "tsv", "csv2"),
  tz = "UTC",
  date_format = "%AD",
  time_format = "%AT",
  ...
)

dat_to_parquet(...)

dat_to_arrow(...)

dat_to_feather(...)

Arguments

data_path

A path or a vector of paths to a .dat.gz file. If supplying a vector of paths, they must share a common data dictionary.

data_dict

A data.frame with start, stop and name columns

output_dir

path to where you want to save the output file

arrow_format

must be one of parquet, arrow or feather

col_types

One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be imputed from guess_max rows on the input interspersed throughout the file. This is convenient (and fast), but not robust. If the imputation fails, you'll need to increase the guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().

Alternatively, you can use a compact string representation where each character represents one column:

c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip

By default, reading a file without a column specification will print a message showing what readr guessed they were. To remove this message, set show_col_types = FALSE or set `options(readr.show_col_types = FALSE).

col_select

A vector of column names

overwrite

logical; should existing destination files be overwritten?

data_format

the format of the input data. Default is "fwf", other choices are "csv", "csv2", "tsv"

tz

what timezone should datetime fields use? Default UTC. This is recommended to avoid timezone pain, but remember that the data is in UTC when doing analysis. See OlsonNames() for list of available timezones.

date_format

date format for columns where date format is not specified in col_types

time_format

time format for columns where time format is not specified in col_types

...

passed to one or arrows writing functions depending on the value of arrow_format

Functions

dat_to_parquet: Convert to parquet
dat_to_arrow: Convert to arrow
dat_to_feather: Convert to feather