nanoparquet's type maps

How nanoparquet maps R types to Parquet types.

R's data types

When writing out a data frame, nanoparquet maps R's data types to Parquet logical types. This is how the mapping is performed.

These rules will likely change until nanoparquet reaches version 1.0.0.

Factors (i.e. vectors that inherit the factor class) are converted to character vectors using as.character(), then written as a STRSXP (character vector) type. The fact that a column is a factor is stored in the Arrow metadata (see below), unless the nanoparquet.write_arrow_metadata option is set to FALSE.
Dates (i.e. the Date class) is written as DATE logical type, which is an INT32 type internally.
hms objects (from the hms package) are written as TIME(true, MILLIS). logical type, which is internally the INT32 Parquet type. Sub-milliseconds precision is lost.
POSIXct objects are written as TIMESTAMP(true, MICROS) logical type, which is internally the INT64 Parquet type. Sub-microsecond precision is lost.
difftime objects (that are not hms objects, see above), are written as an INT64 Parquet type, and noting in the Arrow metadata (see below) that this column has type Duration with NANOSECONDS unit.
Integer vectors (INTSXP) are written as INT(32, true) logical type, which corresponds to the INT32 type.
Real vectors (REALSXP) are written as the DOUBLE type.
Character vectors (STRSXP) are written as the STRING logical type, which has the BYTE_ARRAY type. They are always converted to UTF-8 before writing.
Logical vectors (LGLSXP) are written as the BOOLEAN type.
Other vectors error currently.

You can use parquet_column_types() on a data frame to map R data types to Parquet data types.

Parquet's data types

When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level data types. The exact rules are below.

These rules will likely change until nanoparquet reaches version 1.0.0.

The BOOLEAN type is read as a logical vector (LGLSXP).
The STRING logical type and the UTF8 converted type is read as a character vector with UTF-8 encoding.
The DATE logical type and the DATE converted type are read as a Date R object.
The TIME logical type and the TIME_MILLIS and TIME_MICROS converted types are read as an hms object, see the hms package.
The TIMESTAMP logical type and the TIMESTAMP_MILLIS and TIMESTAMP_MICROS converted types are read as POSIXct objects. If the logical type has the UTC flag set, then the time zone of the POSIXct object is set to UTC.
INT32 is read as an integer vector (INTSXP).
INT64, DOUBLE and FLOAT are read as real vectors (REALSXP).
INT96 is read as a POSIXct read vector with the tzone attribute set to "UTC". It was an old convention to store time stamps as INT96 objects.
The DECIMAL converted type (FIXED_LEN_BYTE_ARRAY or BYTE_ARRAY type) is read as a real vector (REALSXP), potentially losing precision.
The ENUM logical type is read as a character vector.
The UUID logical type is read as a character vector that uses the 00112233-4455-6677-8899-aabbccddeeff form.
BYTE_ARRAY is read as a factor object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.)
Otherwise BYTE_ARRAY is read a list of raw vectors, with missing values denoted by NULL.

Other logical and converted types are read as their annotated low level types:

INT(8, true), INT(16, true) and INT(32, true) are read as integer vectors because they are INT32 internally in Parquet.
INT(64, true) is read as a real vector (REALSXP).
Unsigned integer types INT(8, false), INT(16, false) and INT(32, false) are read as integer vectors (INTSXP). Large positive values may overflow into negative values, this is a known issue that we will fix.
INT(64, false) is read as a real vector (REALSXP). Large positive values may overflow into negative values, this is a known issue that we will fix.
FLOAT16 is a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted by NULL.
INTERVAL is a fixed length byte array, and nanoparquet reads it as a list of raw vectors. Missing values are denoted by NULL.
JSON and BSON are read as character vectors (STRSXP).

These types are not yet supported:

Nested types (LIST, MAP) are not supported.
The UNKNOWN logical type is not supported.

You can use the parquet_column_types() function to see how R would read the columns of a Parquet file. Look at the r_type column.

The Arrow metadata

Apache Arrow (i.e. the arrow R package) adds additional metadata to Parquet files when writing them in arrow::write_parquet(). Then, when reading the file in arrow::read_parquet(), it uses this metadata to recreate the same Arrow and R data types as before writing.

nanoparquet::write_parquet() also adds the Arrow metadata to Parquet files, unless the nanoparquet.write_arrow_metadata option is set to FALSE.

Similarly, nanoparquet::read_parquet() uses the Arrow metadata in the Parquet file (if present), unless the nanoparquet.use_arrow_metadata option is set to FALSE.

The Arrow metadata is stored in the file level key-value metadata, with key ARROW:schema.

Currently nanoparquet uses the Arrow metadata for two things:

It uses it to detect factors. Without the Arrow metadata factors are read as string vectors.
It uses it to detect difftime objects. Without the arrow metadata these are read as INT64 columns, containing the time difference in nanoseconds.

R's data types

Parquet's data types

The Arrow metadata

See also