pq¶
This feature requires jaqy-avro
plugin. It exports
Apache Parquet
files.
Unfortunately, due to the fact that parquet has hard code dependency on hadoop libraries, the size of this plugin is ~20MB.
Options¶
-b,--blocksize <arg> sets the row group / block size
-c,--compression <arg> sets the compression codec
-d,--padding <arg> sets the maximum padding size
-p,--pagesize <arg> sets the page size
-r,--rowcount <arg> sets the row count limit
Note
- For
--pagesize
and--blocksize
, it is possible to usemb
andgb
suffixes to specify the size. For instance,1mb
would be 1 * 1024 * 1024 bytes.
Supported Compression Codecs¶
Compression | extension |
---|---|
brotli | .br |
gzip | .gz |
lz4 | .lz4 |
lzo | .lzo |
snappy | .snappy |
zstd | .zstd |
- It is possible to specify the compression codec implicitly by using the corresponding file extension in the file name.
- LZ4 compression requires the native hadoop installation. This is one of the things hard coded by the Apache Parquet library.
- LZO compression requires a separate library due to its GPL license. Please see https://github.com/twitter/hadoop-lzo for the build instruction.
Database Type to AVRO Type Mapping¶
Database Type | AVRO Type |
---|---|
BOOLEAN | BOOLEAN |
TINYINT SMALLINT INTEGER | INTEGER |
BIGINT | LONG |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
ARRAY | ARRAY |
BINARY VARBINARY LONGVARBINARY BLOB | BYTES |
DECIMAL NUMERIC REAL CHAR VARCHAR CLOB | STRING |
Note
DECIMAL is converted to string to preserve the precision.
Array is in general treated as array of string types. The primary reason is that there is no way to get the array element type in JDBC.
- For PostgreSQL, because such information can be easily guessed, it is supported for some well known types.
Struct is exported as array of string types.
- For Teradata, PERIOD data types, which are transmitted as Struct types, are converted into formats that matches their BTEQ output formats.
- For PostgreSQL, the driver reports Struct type even though the data is actually string. Jaqy had a specific workaround for this inconsistency.
For types not listed in the above table, they are stored as STRING. AVRO exporter relies on the toString() function of the object retrieved by the JDBC driver to obtain the output. There is no guarantee such String representations can be used for import.
For the explanation of page size, row group / block size, etc, see Apache Parquet.
Example¶
-- use snappy compression implicitly
.export pq myfile.parquet.snappy
SELECT * FROM MyTable ORDER BY a;