-
Notifications
You must be signed in to change notification settings - Fork 3k
Create ExtractManager #2295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create ExtractManager #2295
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so clean ! Thanks :)
src/datasets/utils/extract.py
Outdated
| if self._do_extract(output_path, force_extract): | ||
| try: | ||
| self.extractor.extract(input_path, output_path, extractor=extractor) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only catch only the exception that happens when the archive formatted is not identified ? Which errors would you like to catch here ?
Otherwise this could catch unexpected errors.
You can also add the message of the original error to the EnvironmentError as additional information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right: in the original code there was no exception catch.
Indeed, in the original code the raise of EnvironmentError was dead code:
- previously if none of the if conditions was met there was a return (
return output_path),- See
datasets/src/datasets/utils/file_utils.py
Lines 305 to 313 in d7a7223
if ( not is_zipfile(output_path) and not tarfile.is_tarfile(output_path) and not is_gzip(output_path) and not is_xz(output_path) and not is_rarfile(output_path) and not ZstdExtractor.is_extractable(output_path) ): return output_path
- See
- and this part of the code was in the else clause after all of the if conditions repeated again: the else clause was not possible to be executed because of the previous condition I mentioned above
- See:
datasets/src/datasets/utils/file_utils.py
Lines 338 to 369 in d7a7223
if tarfile.is_tarfile(output_path): tar_file = tarfile.open(output_path) tar_file.extractall(output_path_extracted) tar_file.close() elif is_gzip(output_path): os.rmdir(output_path_extracted) with gzip.open(output_path, "rb") as gzip_file: with open(output_path_extracted, "wb") as extracted_file: shutil.copyfileobj(gzip_file, extracted_file) elif is_zipfile(output_path): # put zip file to the last, b/c it is possible wrongly detected as zip with ZipFile(output_path, "r") as zip_file: zip_file.extractall(output_path_extracted) zip_file.close() elif is_xz(output_path): os.rmdir(output_path_extracted) with lzma.open(output_path) as compressed_file: with open(output_path_extracted, "wb") as extracted_file: shutil.copyfileobj(compressed_file, extracted_file) elif is_rarfile(output_path): if config.RARFILE_AVAILABLE: import rarfile rf = rarfile.RarFile(output_path) rf.extractall(output_path_extracted) rf.close() else: raise EnvironmentError("Please pip install rarfile") elif ZstdExtractor.is_extractable(output_path): os.rmdir(output_path_extracted) ZstdExtractor.extract(output_path, output_path_extracted) else: raise EnvironmentError("Archive format of {} could not be identified".format(output_path))
- See:
I would suggest just remove this exception raise and the corresponding try and the exception catch.
|
I think all is done @lhoestq ;) |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks !
Perform refactoring to decouple extract functionality.