Skip to content

Commit bd33489

Browse files
kazuktustvold
andauthored
add parquet-fromcsv (#1) (#1798)
* add parquet-fromcsv (#1) add command line tool for convert csv to parquet. * add `text` for non-rust documentation text * Update parquet/src/bin/parquet-fromcsv.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update parquet/src/bin/parquet-fromcsv.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update parquet/src/bin/parquet-fromcsv.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update parquet/src/bin/parquet-fromcsv.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * automate update help text * remove anyhow * add rat_exclude_files * update test_command_help * fix clippy warnings * add writer-version, max-row-group-size arg * fix cargo fmt lint Co-authored-by: Raphael Taylor-Davies <[email protected]>
1 parent 23acd55 commit bd33489

File tree

4 files changed

+706
-1
lines changed

4 files changed

+706
-1
lines changed

dev/release/rat_exclude_files.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,4 @@ conbench/.isort.cfg
2020
arrow-flight/src/arrow.flight.protocol.rs
2121
arrow-flight/src/sql/arrow.flight.protocol.sql.rs
2222
.github/*
23+
parquet/src/bin/parquet-fromcsv-help.txt

parquet/Cargo.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]
7171
# Enable arrow reader/writer APIs
7272
arrow = ["dep:arrow", "base64"]
7373
# Enable CLI tools
74-
cli = ["serde_json", "base64", "clap"]
74+
cli = ["serde_json", "base64", "clap","arrow/csv"]
7575
# Enable internal testing APIs
7676
test_common = []
7777
# Experimental, unstable functionality primarily used for testing
@@ -91,6 +91,10 @@ required-features = ["cli"]
9191
name = "parquet-rowcount"
9292
required-features = ["cli"]
9393

94+
[[bin]]
95+
name = "parquet-fromcsv"
96+
required-features = ["cli"]
97+
9498
[[bench]]
9599
name = "arrow_writer"
96100
required-features = ["arrow"]
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
parquet 15.0.0
2+
Apache Arrow <[email protected]>
3+
Binary to convert csv to Parquet
4+
5+
USAGE:
6+
parquet [OPTIONS] --schema <SCHEMA> --input-file <INPUT_FILE> --output-file <OUTPUT_FILE>
7+
8+
OPTIONS:
9+
-b, --batch-size <BATCH_SIZE>
10+
batch size
11+
12+
[env: PARQUET_FROM_CSV_BATCHSIZE=]
13+
[default: 1000]
14+
15+
-c, --parquet-compression <PARQUET_COMPRESSION>
16+
compression mode
17+
18+
[default: SNAPPY]
19+
20+
-d, --delimiter <DELIMITER>
21+
field delimiter
22+
23+
default value: when input_format==CSV: ',' when input_format==TSV: 'TAB'
24+
25+
-D, --double-quote <DOUBLE_QUOTE>
26+
double quote
27+
28+
-e, --escape-char <ESCAPE_CHAR>
29+
escape charactor
30+
31+
-f, --input-format <INPUT_FORMAT>
32+
input file format
33+
34+
[default: csv]
35+
[possible values: csv, tsv]
36+
37+
-h, --has-header
38+
has header
39+
40+
--help
41+
Print help information
42+
43+
-i, --input-file <INPUT_FILE>
44+
input CSV file
45+
46+
-m, --max-row-group-size <MAX_ROW_GROUP_SIZE>
47+
max row group size
48+
49+
-o, --output-file <OUTPUT_FILE>
50+
output Parquet file
51+
52+
-q, --quote-char <QUOTE_CHAR>
53+
quate charactor
54+
55+
-r, --record-terminator <RECORD_TERMINATOR>
56+
record terminator
57+
58+
[possible values: lf, crlf, cr]
59+
60+
-s, --schema <SCHEMA>
61+
message schema for output Parquet
62+
63+
-V, --version
64+
Print version information
65+
66+
-w, --writer-version <WRITER_VERSION>
67+
writer version

0 commit comments

Comments
 (0)