REF: move get_filepath_buffer into get_handle#37639
REF: move get_filepath_buffer into get_handle#37639jreback merged 2 commits intopandas-dev:masterfrom twoertwein:io
Conversation
jreback
left a comment
There was a problem hiding this comment.
wow this is a quite a lot of simplification. some questions and a couple requests for followup (don't add to this PR)
pandas/io/parsers.py
Outdated
|
|
||
| if isinstance(src, list): | ||
| self.handles = IOHandles( | ||
| handle=src, compression={"method": None} # type: ignore[arg-type] |
There was a problem hiding this comment.
yeah, that isn't nice (especially from a typing perspective). The parsers seems to cope with objects implementing __next__ (for example lists). I think a solution would be to have a new attribute for the parser (maybe self.iterable : Optional[Iteratable]). If it is None, use self.handles otherwise use the iterable?
There was a problem hiding this comment.
this branch is only used by read_excel and is only used for the PythonParser. I typed src and added comments about this special case.
I will try to fix this mess later this week.
There was a problem hiding this comment.
I think the 'best' solution is to temporarily silence mypy and put a list in IOHandles. PythonParser (the only parser affected by that) had already a check whether the 'file handle' has readline, if not (for the list from read_excel) it treats it as raw data.
edit:
I found a better solution (well-typed and call get_handle after all potential raise(...) calls to avoid leaking file handles)
|
Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-11-13 06:00:27 UTC |
|
I made |
| contains a "b": | ||
| ``df.to_csv(..., mode="wb")`` allows writing a CSV to a file object | ||
| opened binary mode. In most cases, it is not necessary to specify | ||
| ``mode`` as Pandas will auto-detect whether the file object is |
There was a problem hiding this comment.
I haven't yet found an object that needs an explicit mode
| self.ioargs.filepath_or_buffer = BytesIO(contents) # type: ignore[arg-type] | ||
| self.ioargs.should_close = True | ||
| self.path_or_buf = cast(BytesIO, self.ioargs.filepath_or_buffer) | ||
| contents = handles.handle.read() |
There was a problem hiding this comment.
I don't know why that is done. The file handle should always be in binary mode. If it wasn't in binary mode read would return a string which BytesIO would complain about. Maybe there is a performance reason (seeking faster in BytesIO than the provided buffer, just speculating)?
jreback
left a comment
There was a problem hiding this comment.
wow, this is way cleaner that before even. once more merge master and ping on green.
|
thanks @twoertwein very nice! |
|
@twoertwein I have a question about the code you added in this PR. In #31817 I'm adding a new csv reader engine (pyarrow) to which I have to provide class BytesIOWrapper:
"""
Allows the pyarrow engine for read_csv() to read from string buffers
"""
def __init__(self, string_buffer: StringIO, encoding: str = "utf-8"):
self.string_buffer = string_buffer
self.encoding = encoding
def __getattr__(self, attr: str):
return getattr(self.string_buffer, attr)
def read(self, size: int = -1):
content = self.string_buffer.read(size)
return content.encode(self.encoding)but I'm wondering if this can be done using Thanks so much! |
|
Yes, Towards the end of I assume that the following code might get you quite far if not (is_text or _is_binary_mode(handle, mode)):
handle = BytesIOWrapper(handle, encoding)
mode = mode.replace("b", "") |
|
@arw2019 I think that the About adding the |
|
@twoertwein Thank you so much for these responses!!! I'm thinking I'll do a separate PR to add the |
…straint due to renaming of `get_filepath_or_buffer` pandas-dev/pandas#37639 pandas-dev/pandas@6d1541e#diff-934d8564d648e7521db673c6399dcac98e45adfd5230ba47d3aabfcc21979febL247 PEtab-dev/PEtab#493
…fer` private without major release (semver!?) pandas-dev/pandas#37639 pandas-dev/pandas@6d1541e#diff-934d8564d648e7521db673c6399dcac98e45adfd5230ba47d3aabfcc21979febL247
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
Fixes pandas-dev#48700 Refs pandas-dev#9245 Refs pandas-dev#37639 Regressed in 6d1541e
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffThis PR makes
get_filepath_buffera private function and is called insideget_handle- one function to rule all of IO ;)This PR will make it easier for future PRs to make the IO-related interface of
read/to_*more consistent as most of them should support compression/memory mapping (for reading)/(binary) file handles/storage options.Notes to keep track of future follow-up PRs:
get_handlestorage_optionsforto_excel