Skip to content

This function should work on all files, even if read_parquet() is unable to read them, because of an unsupported schema, encoding, compression or other reason.

Usage

parquet_metadata(file)

Arguments

file

Path to a Parquet file.

Value

A named list with entries:

  • file_meta_data: a data frame with file meta data:

    • file_name: file name.

    • version: Parquet version, an integer.

    • num_rows: total number of rows.

    • key_value_metadata: list column of a data frames with two character columns called key and value. This is the key-value metadata of the file. Arrow stores its schema here.

    • created_by: A string scalar, usually the name of the software that created the file.

  • schema: data frame, the schema of the file. It has one row for each node (inner node or leaf node). For flat files this means one root node (inner node), always the first one, and then one row for each "real" column. For nested schemas, the rows are in depth-first search order. Most important columns are:

    • file_name: file name.

    • name: column name.

    • type: data type. One of the low level data types.

    • type_length: length for fixed length byte arrays.

    • repettion_type: character, one of REQUIRED, OPTIONAL or REPEATED.

    • logical_type: a list column, the logical types of the columns. An element has at least an entry called type, and potentially additional entries, e.g. bit_width, is_signed, etc.

    • num_children: number of child nodes. Should be a non-negative integer for the root node, and NA for a leaf node.

  • $row_groups: a data frame, information about the row groups.

  • $column_chunks: a data frame, information about all column chunks, across all row groups. Some important columns:

    • file_name: file name.

    • row_group: which row group this chunk belongs to.

    • column: which leaf column this chunks belongs to. The order is the same as in $schema, but only leaf columns (i.e. columns with NA children) are counted.

    • file_path: which file the chunk is stored in. NA means the same file.

    • file_offset: where the column chunk begins in the file.

    • type: low level parquet data type.

    • encodings: encodings used to store this chunk. It is a list column of character vectors of encoding names. Current possible encodings: "PLAIN", "GROUP_VAR_INT", "PLAIN_DICTIONARY", "RLE", "BIT_PACKED", "DELTA_BINARY_PACKED", "DELTA_LENGTH_BYTE_ARRAY", "DELTA_BYTE_ARRAY", "RLE_DICTIONARY", "BYTE_STREAM_SPLIT".

    • path_in_scema: list column of character vectors. It is simply the path from the root node. It is simply the column name for flat schemas.

    • codec: compression codec used for the column chunk. Possible values are: "UNCOMPRESSED", "SNAPPY", "GZIP", "LZO", "BROTLI", "LZ4", "ZSTD".

    • num_values: number of values in this column chunk.

    • total_uncompressed_size: total uncompressed size in bytes.

    • total_compressed_size: total compressed size in bytes.

    • data_page_offset: absolute position of the first data page of the column chunk in the file.

    • index_page_offset: absolute position of the first index page of the column chunk in the file, or NA if there are no index pages.

    • dictionary_page_offset: absolute position of the first dictionary page of the column chunk in the file, or NA if there are no dictionary pages.

See also

parquet_info() for a much shorter summary. parquet_column_types() and parquet_schema() for column information. read_parquet() to read, write_parquet() to write Parquet files, nanoparquet-types for the R <-> Parquet type mappings.

Examples

file_name <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
nanoparquet::parquet_metadata(file_name)
#> $file_meta_data
#> # A data frame: 1 × 5
#>   file_name                 version num_rows key_value_metadata created_by
#>   <chr>                       <int>    <dbl> <I<list>>          <chr>     
#> 1 /home/runner/work/_temp/…       1     1000 <tbl [1 × 2]>      https://g…
#> 
#> $schema
#> # A data frame: 14 × 11
#>    file_name        name  type  type_length repetition_type converted_type
#>    <chr>            <chr> <chr>       <int> <chr>           <chr>         
#>  1 /home/runner/wo… sche… NA             NA NA              NA            
#>  2 /home/runner/wo… regi… INT64          NA REQUIRED        TIMESTAMP_MIC…
#>  3 /home/runner/wo… id    INT32          NA REQUIRED        INT_32        
#>  4 /home/runner/wo… firs… BYTE…          NA OPTIONAL        UTF8          
#>  5 /home/runner/wo… last… BYTE…          NA REQUIRED        UTF8          
#>  6 /home/runner/wo… email BYTE…          NA OPTIONAL        UTF8          
#>  7 /home/runner/wo… gend… BYTE…          NA OPTIONAL        UTF8          
#>  8 /home/runner/wo… ip_a… BYTE…          NA REQUIRED        UTF8          
#>  9 /home/runner/wo… cc    BYTE…          NA OPTIONAL        UTF8          
#> 10 /home/runner/wo… coun… BYTE…          NA REQUIRED        UTF8          
#> 11 /home/runner/wo… birt… INT32          NA OPTIONAL        DATE          
#> 12 /home/runner/wo… sala… DOUB…          NA OPTIONAL        NA            
#> 13 /home/runner/wo… title BYTE…          NA OPTIONAL        UTF8          
#> 14 /home/runner/wo… comm… BYTE…          NA OPTIONAL        UTF8          
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>,
#> #   scale <int>, precision <int>, field_id <int>
#> 
#> $row_groups
#> # A data frame: 1 × 7
#>   file_name                        id total_byte_size num_rows file_offset
#>   <chr>                         <int>           <dbl>    <dbl>       <dbl>
#> 1 /home/runner/work/_temp/Libr…     0           71427     1000          NA
#> # ℹ 2 more variables: total_compressed_size <dbl>, ordinal <int>
#> 
#> $column_chunks
#> # A data frame: 13 × 19
#>    file_name    row_group column file_path file_offset offset_index_offset
#>    <chr>            <int>  <int> <chr>           <dbl>               <dbl>
#>  1 /home/runne…         0      0 NA                  4                  NA
#>  2 /home/runne…         0      1 NA               6741                  NA
#>  3 /home/runne…         0      2 NA              12259                  NA
#>  4 /home/runne…         0      3 NA              15211                  NA
#>  5 /home/runne…         0      4 NA              16239                  NA
#>  6 /home/runne…         0      5 NA              31759                  NA
#>  7 /home/runne…         0      6 NA              32031                  NA
#>  8 /home/runne…         0      7 NA              42952                  NA
#>  9 /home/runne…         0      8 NA              55009                  NA
#> 10 /home/runne…         0      9 NA              55925                  NA
#> 11 /home/runne…         0     10 NA              59312                  NA
#> 12 /home/runne…         0     11 NA              67026                  NA
#> 13 /home/runne…         0     12 NA              71089                  NA
#> # ℹ 13 more variables: offset_index_length <int>,
#> #   column_index_offset <dbl>, column_index_length <int>, type <chr>,
#> #   encodings <I<list>>, path_in_schema <I<list>>, codec <chr>,
#> #   num_values <dbl>, total_uncompressed_size <dbl>,
#> #   total_compressed_size <dbl>, data_page_offset <dbl>,
#> #   index_page_offset <dbl>, dictionary_page_offset <dbl>
#>