Fix array cast/embed with null values #6283

mariosasko · 2023-10-05T15:24:05Z

Fixes issues with casting/embedding PyArrow list arrays with null values. It also bumps the required PyArrow version to 12.0.0 (over 9 months old) to simplify the implementation.

Fix #6280, fix #6311, fix #6360

(Also fixes #5430 to make Beam compatible with PyArrow>=12.0.0)

github-actions · 2023-10-05T15:31:37Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006278 / 0.011353 (-0.005075)	0.003692 / 0.011008 (-0.007316)	0.080464 / 0.038508 (0.041956)	0.064751 / 0.023109 (0.041642)	0.318586 / 0.275898 (0.042688)	0.351435 / 0.323480 (0.027955)	0.005044 / 0.007986 (-0.002942)	0.003034 / 0.004328 (-0.001295)	0.063710 / 0.004250 (0.059460)	0.050607 / 0.037052 (0.013555)	0.318491 / 0.258489 (0.060001)	0.365688 / 0.293841 (0.071847)	0.027818 / 0.128546 (-0.100729)	0.008119 / 0.075646 (-0.067527)	0.262141 / 0.419271 (-0.157131)	0.044710 / 0.043533 (0.001177)	0.318875 / 0.255139 (0.063736)	0.344559 / 0.283200 (0.061360)	0.022861 / 0.141683 (-0.118822)	1.452402 / 1.452155 (0.000247)	1.502340 / 1.492716 (0.009624)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219355 / 0.018006 (0.201349)	0.433311 / 0.000490 (0.432822)	0.006545 / 0.000200 (0.006345)	0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024538 / 0.037411 (-0.012874)	0.073346 / 0.014526 (0.058821)	0.083824 / 0.176557 (-0.092733)	0.145176 / 0.737135 (-0.591959)	0.085941 / 0.296338 (-0.210397)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.395153 / 0.215209 (0.179944)	3.944734 / 2.077655 (1.867080)	1.883910 / 1.504120 (0.379790)	1.690560 / 1.541195 (0.149365)	1.775180 / 1.468490 (0.306690)	0.506873 / 4.584777 (-4.077904)	3.111095 / 3.745712 (-0.634617)	2.915358 / 5.269862 (-2.354504)	1.892886 / 4.565676 (-2.672791)	0.058690 / 0.424275 (-0.365585)	0.006550 / 0.007607 (-0.001057)	0.463372 / 0.226044 (0.237328)	4.640511 / 2.268929 (2.371583)	2.321051 / 55.444624 (-53.123573)	1.986330 / 6.876477 (-4.890147)	2.160046 / 2.142072 (0.017973)	0.597833 / 4.805227 (-4.207394)	0.127946 / 6.500664 (-6.372718)	0.059709 / 0.075469 (-0.015760)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.278966 / 1.841788 (-0.562822)	17.863102 / 8.074308 (9.788794)	13.896057 / 10.191392 (3.704665)	0.147512 / 0.680424 (-0.532912)	0.016771 / 0.534201 (-0.517430)	0.335260 / 0.579283 (-0.244024)	0.383019 / 0.434364 (-0.051345)	0.384821 / 0.540337 (-0.155516)	0.550143 / 1.386936 (-0.836793)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006234 / 0.011353 (-0.005118)	0.003695 / 0.011008 (-0.007313)	0.062654 / 0.038508 (0.024146)	0.059397 / 0.023109 (0.036287)	0.458375 / 0.275898 (0.182477)	0.488951 / 0.323480 (0.165471)	0.004971 / 0.007986 (-0.003014)	0.002914 / 0.004328 (-0.001415)	0.061184 / 0.004250 (0.056934)	0.051246 / 0.037052 (0.014194)	0.458035 / 0.258489 (0.199546)	0.490838 / 0.293841 (0.196997)	0.028746 / 0.128546 (-0.099800)	0.008167 / 0.075646 (-0.067480)	0.068006 / 0.419271 (-0.351265)	0.041809 / 0.043533 (-0.001724)	0.453896 / 0.255139 (0.198757)	0.477583 / 0.283200 (0.194383)	0.020906 / 0.141683 (-0.120777)	1.443275 / 1.452155 (-0.008879)	1.493431 / 1.492716 (0.000714)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219903 / 0.018006 (0.201896)	0.410275 / 0.000490 (0.409785)	0.003919 / 0.000200 (0.003719)	0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027850 / 0.037411 (-0.009561)	0.080444 / 0.014526 (0.065918)	0.089943 / 0.176557 (-0.086614)	0.145810 / 0.737135 (-0.591326)	0.090908 / 0.296338 (-0.205430)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.464386 / 0.215209 (0.249177)	4.633787 / 2.077655 (2.556133)	2.581658 / 1.504120 (1.077538)	2.408486 / 1.541195 (0.867291)	2.460491 / 1.468490 (0.992001)	0.507512 / 4.584777 (-4.077265)	3.190363 / 3.745712 (-0.555349)	2.895581 / 5.269862 (-2.374280)	1.871506 / 4.565676 (-2.694171)	0.058469 / 0.424275 (-0.365806)	0.006526 / 0.007607 (-0.001082)	0.537641 / 0.226044 (0.311596)	5.396660 / 2.268929 (3.127731)	3.027028 / 55.444624 (-52.417596)	2.703771 / 6.876477 (-4.172705)	2.865576 / 2.142072 (0.723503)	0.600103 / 4.805227 (-4.205124)	0.127109 / 6.500664 (-6.373555)	0.060985 / 0.075469 (-0.014484)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.365030 / 1.841788 (-0.476758)	17.988218 / 8.074308 (9.913909)	14.900796 / 10.191392 (4.709404)	0.158211 / 0.680424 (-0.522213)	0.018291 / 0.534201 (-0.515910)	0.337437 / 0.579283 (-0.241846)	0.383710 / 0.434364 (-0.050654)	0.392341 / 0.540337 (-0.147997)	0.561584 / 1.386936 (-0.825352)

mariosasko · 2023-10-05T15:55:10Z

CI failures are unrelated

lhoestq

Thanks for the fix !

mariosasko · 2023-10-06T13:34:12Z

I also plan to address #6280 (comment) in this PR :).

lhoestq · 2023-10-06T13:46:13Z

Oh ok, ping me again whenever you want another review :)

lhoestq · 2023-11-29T11:13:46Z

Have you had a chance to continue this ? I can also take a look if you want

mariosasko · 2023-12-01T17:54:50Z

Yes, I'll finish it next week :).

…_values

HuggingFaceDocBuilderDev · 2023-12-21T15:56:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This reverts commit 12c4c57.

mariosasko · 2023-12-21T20:00:09Z

@lhoestq Feel free to review this again. I've bumped PyArrow to 12.0.0 to simplify the implementation (no need for custom array_concat and less pa.Array.from_buffers). However, this makes apache-beam complain as it only supports <12.0.0. The next apache-beam release will set this boundary to <15.0.0., so I think the only solution is to wait for it to be published.

…_values

lhoestq

Nice ! Handling extension types can be quite complicated

Btw if you have some pyarrow issues we can link to this PR feel free to add them, this way we can follow the advancements and maybe later simplify this code

lhoestq · 2024-02-06T10:05:33Z

src/datasets/table.py

+                    if array_type != storage_type:
+                        # Temporarily convert to the storage type to support extension types in the slice operation
+                        array = _c(array, storage_type)
+                        array = pc.list_slice(array, 0, pa_type.list_size, return_fixed_size_list=True)


this may bring the data in memory no ?

maybe it's fine for now though

Yes, pc.list_slice brings the array in memory. Unfortunately, I don't think it's possible to avoid this as we need to "expand" the null values in the .values array to prepare them for the ListArray -> FixedSizeListArray cast, which requires a memory allocation.

We (usually) run these casts on subtables before writing them to disk, so this solution should be fine for now.

Casting ListArray -> FixedSizeListArray using PyArrow's ListArray.cast also allocates memory if the array contains null lists, so indeed there isn't much we can do about this due to the difference in the null values storage layout.

src/datasets/table.py

Co-authored-by: Quentin Lhoest <[email protected]>

github-actions · 2024-02-06T19:30:24Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005188 / 0.011353 (-0.006165)	0.003997 / 0.011008 (-0.007011)	0.062642 / 0.038508 (0.024134)	0.028913 / 0.023109 (0.005804)	0.248289 / 0.275898 (-0.027609)	0.268084 / 0.323480 (-0.055396)	0.004093 / 0.007986 (-0.003893)	0.002822 / 0.004328 (-0.001506)	0.048263 / 0.004250 (0.044012)	0.041520 / 0.037052 (0.004468)	0.263277 / 0.258489 (0.004788)	0.289835 / 0.293841 (-0.004006)	0.027621 / 0.128546 (-0.100925)	0.010793 / 0.075646 (-0.064853)	0.207624 / 0.419271 (-0.211648)	0.035597 / 0.043533 (-0.007936)	0.245706 / 0.255139 (-0.009433)	0.268157 / 0.283200 (-0.015043)	0.017310 / 0.141683 (-0.124373)	1.130656 / 1.452155 (-0.321499)	1.162134 / 1.492716 (-0.330583)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.094081 / 0.018006 (0.076075)	0.302298 / 0.000490 (0.301809)	0.000220 / 0.000200 (0.000020)	0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019072 / 0.037411 (-0.018339)	0.061162 / 0.014526 (0.046636)	0.072820 / 0.176557 (-0.103737)	0.122628 / 0.737135 (-0.614507)	0.074962 / 0.296338 (-0.221377)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.277858 / 0.215209 (0.062649)	2.688478 / 2.077655 (0.610823)	1.397366 / 1.504120 (-0.106754)	1.285078 / 1.541195 (-0.256117)	1.291559 / 1.468490 (-0.176931)	0.553646 / 4.584777 (-4.031131)	2.355737 / 3.745712 (-1.389975)	2.773025 / 5.269862 (-2.496836)	1.731195 / 4.565676 (-2.834481)	0.061372 / 0.424275 (-0.362903)	0.004928 / 0.007607 (-0.002679)	0.321703 / 0.226044 (0.095659)	3.212927 / 2.268929 (0.943999)	1.727104 / 55.444624 (-53.717521)	1.479430 / 6.876477 (-5.397047)	1.513436 / 2.142072 (-0.628637)	0.629913 / 4.805227 (-4.175315)	0.114607 / 6.500664 (-6.386057)	0.041707 / 0.075469 (-0.033762)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.976060 / 1.841788 (-0.865727)	11.575163 / 8.074308 (3.500855)	9.521390 / 10.191392 (-0.670003)	0.138725 / 0.680424 (-0.541699)	0.013752 / 0.534201 (-0.520449)	0.286252 / 0.579283 (-0.293031)	0.263420 / 0.434364 (-0.170944)	0.325531 / 0.540337 (-0.214806)	0.419466 / 1.386936 (-0.967470)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005615 / 0.011353 (-0.005738)	0.003884 / 0.011008 (-0.007124)	0.049563 / 0.038508 (0.011055)	0.032573 / 0.023109 (0.009464)	0.276917 / 0.275898 (0.001019)	0.298403 / 0.323480 (-0.025077)	0.004367 / 0.007986 (-0.003618)	0.002794 / 0.004328 (-0.001534)	0.049105 / 0.004250 (0.044855)	0.045597 / 0.037052 (0.008545)	0.289762 / 0.258489 (0.031273)	0.318440 / 0.293841 (0.024599)	0.051883 / 0.128546 (-0.076664)	0.010644 / 0.075646 (-0.065003)	0.057455 / 0.419271 (-0.361816)	0.033667 / 0.043533 (-0.009866)	0.274424 / 0.255139 (0.019285)	0.295890 / 0.283200 (0.012690)	0.017029 / 0.141683 (-0.124654)	1.130123 / 1.452155 (-0.322031)	1.214827 / 1.492716 (-0.277889)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.094882 / 0.018006 (0.076876)	0.302505 / 0.000490 (0.302015)	0.000228 / 0.000200 (0.000028)	0.000052 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021695 / 0.037411 (-0.015716)	0.075196 / 0.014526 (0.060670)	0.086641 / 0.176557 (-0.089915)	0.124893 / 0.737135 (-0.612243)	0.088765 / 0.296338 (-0.207574)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.303388 / 0.215209 (0.088179)	2.934506 / 2.077655 (0.856852)	1.608607 / 1.504120 (0.104487)	1.494632 / 1.541195 (-0.046563)	1.512801 / 1.468490 (0.044310)	0.558563 / 4.584777 (-4.026214)	2.383212 / 3.745712 (-1.362500)	2.634629 / 5.269862 (-2.635233)	1.729319 / 4.565676 (-2.836357)	0.062345 / 0.424275 (-0.361930)	0.004981 / 0.007607 (-0.002626)	0.358333 / 0.226044 (0.132289)	3.484229 / 2.268929 (1.215301)	2.010043 / 55.444624 (-53.434581)	1.693733 / 6.876477 (-5.182744)	1.824150 / 2.142072 (-0.317922)	0.650835 / 4.805227 (-4.154392)	0.115933 / 6.500664 (-6.384732)	0.041270 / 0.075469 (-0.034199)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.007949 / 1.841788 (-0.833838)	12.000085 / 8.074308 (3.925776)	10.453119 / 10.191392 (0.261727)	0.143583 / 0.680424 (-0.536840)	0.015937 / 0.534201 (-0.518264)	0.286653 / 0.579283 (-0.292631)	0.272359 / 0.434364 (-0.162005)	0.330520 / 0.540337 (-0.209818)	0.417015 / 1.386936 (-0.969921)

20141888 · 2024-07-04T07:24:19Z

Still the problem is occured.
Huggingface is sucks 🤮🤮🤮🤮

Fix array.values handling in array cast/embed

b7571ab

mariosasko requested a review from lhoestq October 5, 2023 15:24

mariosasko mentioned this pull request Oct 5, 2023

Better list array values handling in cast/embed storage #6213

Closed

Fix fixed size array with nulls cast

feb1c1a

lhoestq approved these changes Oct 6, 2023

View reviewed changes

mariosasko marked this pull request as draft October 6, 2023 13:33

mariosasko mentioned this pull request Oct 18, 2023

cast_column to Sequence with length=4 occur exception raise in datasets/table.py:2146 #6311

Closed

mariosasko mentioned this pull request Nov 2, 2023

Add support for Sequence(Audio/Image) feature in push_to_hub #6360

Closed

mariosasko added 9 commits December 5, 2023 19:46

Bump PyArrow to version 12.0.0

0b2ad10

Fix cast/embed

fda0d31

Resolve merge conflicts

f7b48ba

Remove pdb comment

024c029

Add warnings and some comments

09b5e15

CI fix

19ee42b

Onemore comment

1505ce6

Merge branch 'main' of github.com:huggingface/datasets into fix-array…

5c8aa27

…_values

Don't install beam

da085d8

mariosasko added 3 commits December 21, 2023 17:20

Fix tests

2aec0f7

Still run beam tests?

12c4c57

Revert "Still run beam tests?"

00e7856

This reverts commit 12c4c57.

mariosasko added 2 commits January 23, 2024 19:21

Nit

9a694d8

Merge branch 'main' of github.com:huggingface/datasets into fix-array…

c167caa

…_values

mariosasko added 6 commits January 26, 2024 16:19

Cleaner implementation

4828edf

Cleaner impl part 2

087140e

Resolve conflict

68faedb

Nit

2881a1a

Fix CI

86c8ac2

Merge branch 'main' of github.com:huggingface/datasets into fix-array…

2a211e1

…_values

mariosasko changed the title ~~Fix array.values handling in array cast/embed~~ Fix array cast/embed with null values Jan 31, 2024

mariosasko marked this pull request as ready for review February 1, 2024 00:00

mariosasko added 3 commits February 1, 2024 19:30

Optimization

79ee0df

Nit

d088db4

Nit

c9343c0

lhoestq approved these changes Feb 6, 2024

View reviewed changes

Update src/datasets/table.py

3a68113

Co-authored-by: Quentin Lhoest <[email protected]>

mariosasko merged commit ac05bac into main Feb 6, 2024

mariosasko deleted the fix-array_values branch February 6, 2024 19:24

lhoestq mentioned this pull request Feb 9, 2024

Batched dataset map throws exception that cannot cast fixed length array to Sequence #6654

Closed

StevenSong mentioned this pull request Mar 12, 2024

Errror when saving to disk a dataset of images #5717

Open

albertvillanova mentioned this pull request Jul 3, 2024

Fix casting list array to fixed size list #7021

Merged

red-hat-konflux bot mentioned this pull request Sep 6, 2025

Update dependency datasets to v2.21.0 rhoai-rhtap/training-operator#31

Open

1 task

Fix array cast/embed with null values #6283

Fix array cast/embed with null values #6283

Uh oh!

Conversation

mariosasko commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 5, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

mariosasko commented Oct 5, 2023

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

mariosasko commented Oct 6, 2023

Uh oh!

lhoestq commented Oct 6, 2023

Uh oh!

lhoestq commented Nov 29, 2023

Uh oh!

mariosasko commented Dec 1, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Dec 21, 2023

Uh oh!

mariosasko commented Dec 21, 2023

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

mariosasko Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariosasko Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

20141888 commented Jul 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mariosasko commented Oct 5, 2023 •

edited

Loading

mariosasko Feb 6, 2024 •

edited

Loading