nanoparquet: Read and Write 'Parquet' Files

Self-sufficient reader and writer for flat 'Parquet' files. Can read most 'Parquet' data types. Can write many 'R' data types, including factors and temporal types. See docs for limitations.

Details

nanoparquet is a reader and writer for a common subset of Parquet files.

Features:

Read and write flat (i.e. non-nested) Parquet files.
Can read most Parquet data types.
Can write many R data types, including factors and temporal types to Parquet.
Completely dependency free.
Supports Snappy, Gzip and Zstd compression.

Limitations:

Nested Parquet types are not supported.
Some Parquet logical types are not supported: FLOAT16, INTERVAL, UNKNOWN.
Only Snappy, Gzip and Zstd compression is supported.
Encryption is not supported.
Reading files from URLs is not supported.
Being single-threaded and not fully optimized, nanoparquet is probably not suited well for large data sets. It should be fine for a couple of gigabytes. Reading or writing a ~250MB file that has 32 million rows and 14 columns takes about 10-15 seconds on an M2 MacBook Pro. For larger files, use Apache Arrow or DuckDB.

Installation

Install the R package from CRAN:

install.packages("nanoparquet")

Usage

Read

Call read_parquet() to read a Parquet file:

df <- nanoparquet::read_parquet("example.parquet")

To see the columns of a Parquet file and how their types are mapped to R types by read_parquet(), call parquet_column_types() first:

nanoparquet::parquet_column_types("example.parquet")

Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:

df <- data.table::rbindlist(lapply(
  Sys.glob("some-folder/part-*.parquet"),
  nanoparquet::read_parquet
))

Write

Call write_parquet() to write a data frame to a Parquet file:

nanoparquet::write_parquet(mtcars, "mtcars.parquet")

To see how the columns of the data frame will be mapped to Parquet types by write_parquet(), call parquet_column_types() first:

nanoparquet::parquet_column_types(mtcars)

Inspect

Call parquet_info(), parquet_column_types(), parquet_schema() or parquet_metadata() to see various kinds of metadata from a Parquet file:

parquet_info() shows a basic summary of the file.
parquet_column_types() shows the leaf columns, these are are the ones that read_parquet() reads into R.
parquet_schema() shows all columns, including non-leaf columns.
parquet_metadata() shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.

nanoparquet::parquet_info("mtcars.parquet")
nanoparquet::parquet_column_types("mtcars.parquet")
nanoparquet::parquet_schema("mtcars.parquet")
nanoparquet::parquet_metadata("mtcars.parquet")

If you find a file that should be supported but isn't, please open an issue here with a link to the file.

Options

License

MIT

Author

Maintainer: Gábor Csárdi csardi.gabor@gmail.com

Authors:

Hannes Mühleisen (ORCID) [copyright holder]

Other contributors:

Google Inc. [copyright holder]
Apache Software Foundation [copyright holder]
Posit Software, PBC [copyright holder]
RAD Game Tools [copyright holder]
Valve Software [copyright holder]
Tenacious Software LLC [copyright holder]
Facebook, Inc. [copyright holder]