ENH: Support MultiIndex columns in parquet (#34777)#36305
ENH: Support MultiIndex columns in parquet (#34777)#36305jreback merged 20 commits intopandas-dev:masterfrom
Conversation
1. Update check to handle MultiIndex columns for parquet format
1. add whatsnew entry
dsaxton
left a comment
There was a problem hiding this comment.
Can you also add a test?
1. Update check to handle MultiIndex columns for parquet format 2. Edit whatsnew entry. 3. Add test for writing MultiIndex columns with string column names
1. Include issue number as a comment on added test
pandas/io/parquet.py
Outdated
| if not all( | ||
| x.inferred_type in {"string", "empty"} for x in df.columns.levels | ||
| ): | ||
| raise ValueError("parquet must have string column names") |
There was a problem hiding this comment.
can say something about 'for all values in each level of the MultiIndex'
There was a problem hiding this comment.
Thanks @jreback for the suggestion on the exception statement - adding that into my next commit!
pandas/tests/io/test_parquet.py
Outdated
|
|
||
| def test_write_column_multiindex_string(self, pa): | ||
| # GH #34777 | ||
| # Not supported in fastparquet as of 0.1.3 or older pyarrow version |
There was a problem hiding this comment.
are we > than the min pyarrow version?
There was a problem hiding this comment.
Based on the min versions listed in pandas dependencies, the min pyarrow version is 0.15 while we are currently at 0.16 - at least for the dev environment that I'm working on.
pandas/tests/io/test_parquet.py
Outdated
| ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], | ||
| ["one", "two", "one", "two", "one", "two", "one", "two"], | ||
| ] | ||
| df = pd.DataFrame(np.random.randn(8, 8), columns=arrays) |
There was a problem hiding this comment.
can you add names to the MultiIndex levels. do these round trip?
There was a problem hiding this comment.
After adding names to the MultiIndex levels, looks like they do round trip on pytest.
doc/source/whatsnew/v1.2.0.rst
Outdated
| - :meth:`to_csv` did not support zip compression for binary file object not having a filename (:issue: `35058`) | ||
| - :meth:`to_csv` and :meth:`read_csv` did not honor `compression` and `encoding` for path-like objects that are internally converted to file-like objects (:issue:`35677`, :issue:`26124`, and :issue:`32392`) | ||
| - :meth:`to_picke` and :meth:`read_pickle` did not support compression for file-objects (:issue:`26237`, :issue:`29054`, and :issue:`29570`) | ||
| - :meth:`to_parquet` did not support :class:`MultiIndex` for columns in parquet format (:issue:`34777`) |
There was a problem hiding this comment.
i would move this to other enhancements; say this is enabled with pyarrow=.....
There was a problem hiding this comment.
Moved the whatsnew entry to "other enhancements" - thanks!
1. Add tests for writing Indexes and MultiIndexes for columns 2. Edit message for check to handle MultiIndex columns for parquet 3. Edit whatsnew entry to move entry to other enhancements
1. Fix PEP8 issue for error message in check for MultiIndex columns
jreback
left a comment
There was a problem hiding this comment.
looks fine, can you add a whatsnew note; enhancements in 1.2
does this have a min pyarrow version?
add whatsnew entry: enhancements in 1.2
|
@jreback Since the enhancements work on pyarrow 0.15.0, we could leave the min pyarrow version as >= 0.15.0 for this pull request. I've added a whatsnew note on enhancements in 1.2 under "Other enhancements" - feel free to let me know if a more detailed whatsnew note is needed. :) P.S. Tests are failing after I did a rebase to update my branch. |
|
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
05a2f62 to
039094c
Compare
|
Test failures look unrelated, looks related to #37818 |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
charlesdong1991
left a comment
There was a problem hiding this comment.
CI is back green, one minor comment on whatsnew
|
@hweecat small comments, can you merge master and ping on green. |
|
thanks @hweecat and @charlesdong1991 |
|
all credits to @hweecat very nice job! |
|
Any reason why this explicitly raises errors for non-string multi-indicies? |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffSupport MultiIndex for columns in parquet format by updating value column names check to handle MultiIndexes.