Skip to content

Conversation

@mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Mar 9, 2022

This is an attempt to make the user-facing datasets' submodule namespace cleaner:

In particular, this PR does the following:

  • removes the unused zip_nested and flatten_nest_dict and their accompanying tests
  • removes pyarrow from the top-level namespace
  • properly uses __all__ and the from <module> import * syntax to avoid importing the <module>'s submodules
  • cleans up the utils namespace
  • moves the temp_seed context manage from datasets/utils/file_utils.py to datasets/utils/py_utils.py

@severo
Copy link
Collaborator

severo commented Mar 9, 2022

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@mariosasko
Copy link
Collaborator Author

@severo No, this PR doesn't fix that issue in the current state. We can fix it by adding __all__ to datasets/__init__.py and datasets/formatting/__init__.py. However, this would require updating __all__ for each new function/class definition, which could become cumbersome, and we can't do this dynamically because mypy is a static type checker.

@lhoestq @albertvillanova WDYT?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! This is much cleaner

# Add vectors
logger.info(f"Adding {len(vectors)} vectors to the faiss index")
for i in utils.tqdm(range(0, len(vectors), batch_size), disable=not utils.is_progress_bar_enabled()):
for i in utils.tqdm_utils.tqdm(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine if we have tqdm directly in utils with disable_progress_bar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if I address this in a separate PR before the release (I'll make disable_progress_bar public again + add enable_progress_bar, enable_caching and disable_caching) to close #3586?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure !

Comment on lines -34 to -35
from .arrow_reader import ArrowReader, ReadInstruction
from .arrow_writer import ArrowWriter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here ArrowWriter and ArrowReader are removed from the top level module. I've seen that the lxmert example in transformers needs datasets.ArrowWriter, so I'm not sure if we should remove them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are removed because I see them as our internals. I'll open a PR in Transformers to update their paths (used here and here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok :)

@lhoestq
Copy link
Member

lhoestq commented Mar 11, 2022

Feel free to merge this one if it's good for you :)

@mariosasko mariosasko merged commit ba4d30c into master Mar 11, 2022
@mariosasko mariosasko deleted the cleaner-imports branch March 11, 2022 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants