-
Notifications
You must be signed in to change notification settings - Fork 467
Add a proposal process #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
e0cc257
faf6b23
ae98d5d
3d0eb6f
edb8e82
51a4a3f
9cdce64
ef310c3
39c6fd9
7e3cb6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| --- | ||
| Author: Julien Le Dem | ||
| Created: 2025-Aug-7 | ||
| Name: add BASE64 compression | ||
| Issue: https://github.com/apache/parquet-format/issues/NNN | ||
| Status: ARCHIVED | ||
| Reason: Did not compress | ||
| --- | ||
|
|
||
| # Proposal | ||
|
|
||
| ## Description | ||
| Add Base64 to compression algorithms. | ||
| This is not backwards compatible as a new compression alg. | ||
|
|
||
| ## Spec | ||
|
|
||
| See [BASE64 spec]. | ||
|
|
||
| ## Evaluation | ||
|
|
||
| After trying out in the java implementation, file size doubled on average. | ||
| See prototype [here](github.com/julienledem/mypoc) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # Proposals | ||
|
|
||
| This proposal process is intended for significantly impactfull additions to the Parquet spec. The goal is to facilitate those projects and help them being contributed to Parquet. | ||
julienledem marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| For example, changes that are not backward compatible like adding a new encoding or a new compression algorithm (older implementations can not read new files). | ||
|
||
| This gives better visibility to those projects which require coordination in several implementations. | ||
| Bug fixes, code only features or minor changes to the spec that can be ignored by older implementations can simply be filed as a github issue. | ||
|
|
||
| ## Proposal lifecycle | ||
|
|
||
| Discuss -> Draft/POC -> Implementation -> Approval | ||
|
|
||
| ### Discuss | ||
| Start a [DISCUSS] thread on the mailing list (dev@parquet.apache.org) with your idea. At this point, the community can discuss whether the impact of the proposal requires a document here or just be a github issue. | ||
| Once you have a better idea of the general consensus on the proposal, open a github issue using the proposal template. | ||
julienledem marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Attaching a google doc to collect feedback and collaborate with the community works usually well early on. | ||
|
|
||
| *Transition:* Once you feel you received enough feedback or need to start the POC to have better answers to questions you get, you can move to the next step. Anybody is free to start POCs anytime. We just recommend getting feedback before you spend a significant amount of your time. | ||
|
|
||
| ### Draft/POC | ||
| Once you feel the discussion has stabilized and you are ready to start a POC, open a PR to add a new Markdown file in the proposals folder and give more visibility to the work in progress. | ||
| The proposal document can evolve along the course of the POC. In particular to add more links to findings and performance evaluations. Collaboration is encouraged. More validation on the POC increases the chances of success. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I worry a little bit about the friction of contributors keeping the markdown up-to-date? Maybe keeping this in the google doc? But we can see how it works in practice.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have changed the wording to remove this as a requirement. I have left it as an option if people want to have a perennial place to save documents. |
||
|
|
||
| Example: [https://github.com/apache/parquet-format/pull/221] | ||
|
|
||
| Make sure you consider the [requirements document](https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0#heading=h.v4emiipkghrx) to ensure the success of the POC. (Note: this doc would become a markdown page in the repo) | ||
|
|
||
| *Transition:* There is enough clarity on the spec for the new feature and we have identified the reference implementations to be implemented to be able to release. | ||
julienledem marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Implementation | ||
| Once we have reached enough consensus on the formalized spec change and validated it through the POC, we should have a clear idea of whether we want to pursue the implementation accross the ecosystem. | ||
julienledem marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| At this stage we should finalize a formal spec contribution to parquet-format and we need to meet the contribution guidelines to consider the implementation finished. | ||
| See [CONTRIBUTING guidelines](https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format). | ||
|
|
||
| *Transition:* A PMC vote will formalize that we have concluded the implementation and are ready to release. | ||
|
|
||
| ### Approval | ||
| Once the implementation phase is finished, we can include the contribution in the next release. Congrats! | ||
|
|
||
| ## Active Proposals | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a nice way to list / find the active proposal |
||
|
|
||
| | ID | Description | Status | | ||
| |-----|--------------|---------| | ||
| | [github issue] | adding this new encoding | POC | | ||
| | [github issue] | add Variant type | Implementation | | ||
|
|
||
| ## Implemented | ||
| | ID | Description | Status | release it was added | | ||
| |-----|--------------|---------|-----------------------| | ||
| | [gihub issue] | encryption | Completed | x.y.z | | ||
julienledem marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Archived | ||
|
|
||
| | ID | Description | Status | reason for archiving | | ||
| |-----|--------------|---------|-----------------------| | ||
| | [github issue] | [adding base64 compression](1_BASE64_ENCODING.md) | Archived | POC showed that compression ratio was not practical | | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Proposal | ||
|
|
||
| --- | ||
| Author: ~your name~ | ||
| Created: ~date~ | ||
| Name: *short sentence describing the proposal* | ||
| Issue: https://github.com/apache/parquet-format/issues/NNN | ||
| Status: DRAFT|IMPLEMENTATION|COMPLETED | ||
| --- | ||
|
|
||
| ## Description | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend also adding a "## Rationale" section: Describe why this is a feature that will improve the parquet format and what alternatives currently exist for the usecase (e.g. must use a different format, or "must build additional infrastructure to avoid re-parsing footer on each query", or "must use a general purpose compression algorithm to achieve the same space, thus slowing down query performance)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added! |
||
| *Short description of the proposal. Is it a new encoding? Is it backwards compatible (old readers will just ignore it)? Is it additional metadata?* | ||
|
|
||
| ## Rationale | ||
| Describe why this is a feature that will improve the parquet format and what alternatives currently exist for the usecase (e.g. must use a different format, or "must build additional infrastructure to avoid re-parsing footer on each query", or "must use a general purpose compression algorithm to achieve the same space, thus slowing down query performance) | ||
julienledem marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Spec | ||
|
|
||
| At the proposal stage you don't need a fully fleshed out spec yet. | ||
| Please add any link to relevant documentation, papers, etc. | ||
| at the implementation stage, the details will need to be all clarified. | ||
|
|
||
| ## Evaluation | ||
| What datasets is it tested on and what is a success criteria | ||
| Please add any link to the relevant codebase. | ||
Uh oh!
There was an error while loading. Please reload this page.