Skip to content

Conversation

@bhavitvyamalik
Copy link
Contributor

Fixes #625. This lets the user preserve the dtype of numpy array to pyarrow array which was getting lost due to conversion of numpy array -> list -> pyarrow array.

@bhavitvyamalik
Copy link
Contributor Author

Hi @lhoestq,
It turns out that pyarrow ListArray are not recognized as list-like when we get output from numpy_to_pyarrow_listarray. This might cause tests to fail. If possible can we convert that ListArray output to list inorder for tests to pass? Under the hood it'll maintain the dtype as that of numpy array passed during input only

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! To fix this I added a comment to actually keep the numpy array unchanged until it is passed to a TypedSequence. This way we don't have to deal with the ListArray issue

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented May 19, 2021

Brought down the failing tests from 7 to 4. Let me know if that part looks good. Failing tests are looking quite similar. In test_map_torch

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),
and test_map_tf
Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

they're expecting float64. Shouldn't that be float32 now?

@bhavitvyamalik bhavitvyamalik requested a review from lhoestq May 19, 2021 09:04
@lhoestq
Copy link
Member

lhoestq commented May 19, 2021

It's normal: pytorch and tensorflow use float32 by default, unlike numpy which uses float64.

I think that we should always keep the precision of the original tensor (torch/tf/numpy).
It means that as it is in this PR it's fine (the precision is conserved when doing the torch/tf -> numpy conversion).

This is a breaking change but in my opinion the fact that we had Value("float64") for torch.float32 tensors was an issue already.

Let me know what you think. Cc @albertvillanova if you have an opinion on this

If we agree on doing this breaking change, we can just change the test.

@bhavitvyamalik
Copy link
Contributor Author

Hi @lhoestq,
Merged master into this branch. Only changing the test is left for now (mentioned below) after which all tests should pass.

Brought down the failing tests from 7 to 4. Let me know if that part looks good. Failing tests are looking quite similar. In test_map_torch

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

and test_map_tf

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

they're expecting float64. Shouldn't that be float32 now?

@lhoestq
Copy link
Member

lhoestq commented Jul 29, 2021

they're expecting float64. Shouldn't that be float32 now?

Yes feel free to update those tests :)

It would be nice to have the same test for JAX as well

@bhavitvyamalik
Copy link
Contributor Author

Added same test for for JAX too. Also, I saw that I missed changing test_cast_to_python_objects_jax like I did for TF and PyTorch. Finished that as well

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks all good !
Thanks a lot :)

@lhoestq lhoestq changed the title preserve dtype for numpy arrays Preserve dtype for numpy/torch/tf/jax arrays Aug 17, 2021
@lhoestq lhoestq merged commit 665bac8 into huggingface:master Aug 17, 2021
JayantGoel001 added a commit to JayantGoel001/datasets-1 that referenced this pull request Aug 17, 2021
Preserve dtype for numpy/torch/tf/jax arrays (huggingface#2361)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dtype of tensors should be preserved

2 participants