Preserve dtype for numpy/torch/tf/jax arrays #2361

bhavitvyamalik · 2021-05-14T14:45:23Z

Fixes #625. This lets the user preserve the dtype of numpy array to pyarrow array which was getting lost due to conversion of numpy array -> list -> pyarrow array.

bhavitvyamalik · 2021-05-15T21:25:41Z

Hi @lhoestq,
It turns out that pyarrow ListArray are not recognized as list-like when we get output from numpy_to_pyarrow_listarray. This might cause tests to fail. If possible can we convert that ListArray output to list inorder for tests to pass? Under the hood it'll maintain the dtype as that of numpy array passed during input only

lhoestq

Thanks ! To fix this I added a comment to actually keep the numpy array unchanged until it is passed to a TypedSequence. This way we don't have to deal with the ListArray issue

src/datasets/features.py

src/datasets/arrow_writer.py

tests/test_features.py

bhavitvyamalik · 2021-05-19T09:00:33Z

Brought down the failing tests from 7 to 4. Let me know if that part looks good. Failing tests are looking quite similar. In test_map_torch

datasets/tests/test_arrow_dataset.py

Line 1039 in 3d46bc3

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

and test_map_tf

datasets/tests/test_arrow_dataset.py

Line 1056 in 3d46bc3

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

they're expecting float64. Shouldn't that be float32 now?

lhoestq · 2021-05-19T17:30:41Z

It's normal: pytorch and tensorflow use float32 by default, unlike numpy which uses float64.

I think that we should always keep the precision of the original tensor (torch/tf/numpy).
It means that as it is in this PR it's fine (the precision is conserved when doing the torch/tf -> numpy conversion).

This is a breaking change but in my opinion the fact that we had Value("float64") for torch.float32 tensors was an issue already.

Let me know what you think. Cc @albertvillanova if you have an opinion on this

If we agree on doing this breaking change, we can just change the test.

bhavitvyamalik · 2021-07-29T12:03:46Z

Hi @lhoestq,
Merged master into this branch. Only changing the test is left for now (mentioned below) after which all tests should pass.

Brought down the failing tests from 7 to 4. Let me know if that part looks good. Failing tests are looking quite similar. In test_map_torch

datasets/tests/test_arrow_dataset.py

Line 1039 in 3d46bc3

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

and test_map_tf

datasets/tests/test_arrow_dataset.py

Line 1056 in 3d46bc3

Features({"filename": Value("string"), "tensor": Sequence(Value("float64"))}),

they're expecting float64. Shouldn't that be float32 now?

src/datasets/features.py

lhoestq · 2021-07-29T14:15:03Z

they're expecting float64. Shouldn't that be float32 now?

Yes feel free to update those tests :)

It would be nice to have the same test for JAX as well

bhavitvyamalik · 2021-07-30T07:28:23Z

Added same test for for JAX too. Also, I saw that I missed changing test_cast_to_python_objects_jax like I did for TF and PyTorch. Finished that as well

lhoestq

It looks all good !
Thanks a lot :)

Preserve dtype for numpy/torch/tf/jax arrays (huggingface#2361)

bhavitvyamalik added 4 commits May 14, 2021 20:08

removed numpy to list casting

a8f2264

change in test for numpy array

362caef

scalar fix

1d7800e

Merge remote-tracking branch 'origin/master' into numpy_speedup

9a26741

lhoestq reviewed May 17, 2021

View reviewed changes

src/datasets/features.py Outdated Show resolved Hide resolved

src/datasets/arrow_writer.py Outdated Show resolved Hide resolved

tests/test_features.py Show resolved Hide resolved

bhavitvyamalik added 4 commits May 18, 2021 13:08

make numpy default

c2c3c00

Merge remote-tracking branch 'origin/master' into numpy_speedup

efac5aa

check dict type objects with numpy dtype

7af8415

update name, remove comments

8c008ca

bhavitvyamalik requested a review from lhoestq May 19, 2021 09:04

bhavitvyamalik added 2 commits July 29, 2021 17:16

merge master and add jax numpy

5028d94

jax import error

4996418

lhoestq reviewed Jul 29, 2021

View reviewed changes

src/datasets/features.py Outdated Show resolved Hide resolved

bhavitvyamalik added 5 commits July 29, 2021 22:35

add test for JAX

3e46ca3

fix imports

7a9cf64

add numpy to test_cast_to_python_objects_jax

33c4c0c

minor fix

94bac8a

trigger build

58d8ecd

lhoestq added 2 commits August 16, 2021 15:24

fix TypedSequence with numpy data and Array type

ba4e4b5

Merge remote-tracking branch 'upstream/master' into numpy_speedup

cf986a1

lhoestq approved these changes Aug 17, 2021

View reviewed changes

lhoestq changed the title ~~preserve dtype for numpy arrays~~ Preserve dtype for numpy/torch/tf/jax arrays Aug 17, 2021

lhoestq merged commit 665bac8 into huggingface:master Aug 17, 2021

JayantGoel001 added a commit to JayantGoel001/datasets-1 that referenced this pull request Aug 17, 2021

Merge pull request #80 from huggingface/master

204bac2

Preserve dtype for numpy/torch/tf/jax arrays (huggingface#2361)

This was referenced Sep 15, 2021

Using a list of multi-dim numpy arrays raises an error "can only convert 1-dimensional array values" #2921

Closed

Fix conversion of multidim arrays in list to arrow #2922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve dtype for numpy/torch/tf/jax arrays #2361

Preserve dtype for numpy/torch/tf/jax arrays #2361

Uh oh!

bhavitvyamalik commented May 14, 2021

Uh oh!

bhavitvyamalik commented May 15, 2021

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhavitvyamalik commented May 19, 2021 •

edited

Loading

Uh oh!

lhoestq commented May 19, 2021 •

edited

Loading

Uh oh!

bhavitvyamalik commented Jul 29, 2021

Uh oh!

Uh oh!

lhoestq commented Jul 29, 2021

Uh oh!

bhavitvyamalik commented Jul 30, 2021

Uh oh!

lhoestq left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preserve dtype for numpy/torch/tf/jax arrays #2361

Preserve dtype for numpy/torch/tf/jax arrays #2361

Uh oh!

Conversation

bhavitvyamalik commented May 14, 2021

Uh oh!

bhavitvyamalik commented May 15, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhavitvyamalik commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhavitvyamalik commented Jul 29, 2021

Uh oh!

Uh oh!

lhoestq commented Jul 29, 2021

Uh oh!

bhavitvyamalik commented Jul 30, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bhavitvyamalik commented May 19, 2021 •

edited

Loading

lhoestq commented May 19, 2021 •

edited

Loading