Skip to content

Conversation

@mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Oct 5, 2023

Fixes issues with casting/embedding PyArrow list arrays with null values. It also bumps the required PyArrow version to 12.0.0 (over 9 months old) to simplify the implementation.

Fix #6280, fix #6311, fix #6360

(Also fixes #5430 to make Beam compatible with PyArrow>=12.0.0)

@github-actions
Copy link

github-actions bot commented Oct 5, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006278 / 0.011353 (-0.005075) 0.003692 / 0.011008 (-0.007316) 0.080464 / 0.038508 (0.041956) 0.064751 / 0.023109 (0.041642) 0.318586 / 0.275898 (0.042688) 0.351435 / 0.323480 (0.027955) 0.005044 / 0.007986 (-0.002942) 0.003034 / 0.004328 (-0.001295) 0.063710 / 0.004250 (0.059460) 0.050607 / 0.037052 (0.013555) 0.318491 / 0.258489 (0.060001) 0.365688 / 0.293841 (0.071847) 0.027818 / 0.128546 (-0.100729) 0.008119 / 0.075646 (-0.067527) 0.262141 / 0.419271 (-0.157131) 0.044710 / 0.043533 (0.001177) 0.318875 / 0.255139 (0.063736) 0.344559 / 0.283200 (0.061360) 0.022861 / 0.141683 (-0.118822) 1.452402 / 1.452155 (0.000247) 1.502340 / 1.492716 (0.009624)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.219355 / 0.018006 (0.201349) 0.433311 / 0.000490 (0.432822) 0.006545 / 0.000200 (0.006345) 0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024538 / 0.037411 (-0.012874) 0.073346 / 0.014526 (0.058821) 0.083824 / 0.176557 (-0.092733) 0.145176 / 0.737135 (-0.591959) 0.085941 / 0.296338 (-0.210397)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.395153 / 0.215209 (0.179944) 3.944734 / 2.077655 (1.867080) 1.883910 / 1.504120 (0.379790) 1.690560 / 1.541195 (0.149365) 1.775180 / 1.468490 (0.306690) 0.506873 / 4.584777 (-4.077904) 3.111095 / 3.745712 (-0.634617) 2.915358 / 5.269862 (-2.354504) 1.892886 / 4.565676 (-2.672791) 0.058690 / 0.424275 (-0.365585) 0.006550 / 0.007607 (-0.001057) 0.463372 / 0.226044 (0.237328) 4.640511 / 2.268929 (2.371583) 2.321051 / 55.444624 (-53.123573) 1.986330 / 6.876477 (-4.890147) 2.160046 / 2.142072 (0.017973) 0.597833 / 4.805227 (-4.207394) 0.127946 / 6.500664 (-6.372718) 0.059709 / 0.075469 (-0.015760)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.278966 / 1.841788 (-0.562822) 17.863102 / 8.074308 (9.788794) 13.896057 / 10.191392 (3.704665) 0.147512 / 0.680424 (-0.532912) 0.016771 / 0.534201 (-0.517430) 0.335260 / 0.579283 (-0.244024) 0.383019 / 0.434364 (-0.051345) 0.384821 / 0.540337 (-0.155516) 0.550143 / 1.386936 (-0.836793)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006234 / 0.011353 (-0.005118) 0.003695 / 0.011008 (-0.007313) 0.062654 / 0.038508 (0.024146) 0.059397 / 0.023109 (0.036287) 0.458375 / 0.275898 (0.182477) 0.488951 / 0.323480 (0.165471) 0.004971 / 0.007986 (-0.003014) 0.002914 / 0.004328 (-0.001415) 0.061184 / 0.004250 (0.056934) 0.051246 / 0.037052 (0.014194) 0.458035 / 0.258489 (0.199546) 0.490838 / 0.293841 (0.196997) 0.028746 / 0.128546 (-0.099800) 0.008167 / 0.075646 (-0.067480) 0.068006 / 0.419271 (-0.351265) 0.041809 / 0.043533 (-0.001724) 0.453896 / 0.255139 (0.198757) 0.477583 / 0.283200 (0.194383) 0.020906 / 0.141683 (-0.120777) 1.443275 / 1.452155 (-0.008879) 1.493431 / 1.492716 (0.000714)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.219903 / 0.018006 (0.201896) 0.410275 / 0.000490 (0.409785) 0.003919 / 0.000200 (0.003719) 0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027850 / 0.037411 (-0.009561) 0.080444 / 0.014526 (0.065918) 0.089943 / 0.176557 (-0.086614) 0.145810 / 0.737135 (-0.591326) 0.090908 / 0.296338 (-0.205430)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.464386 / 0.215209 (0.249177) 4.633787 / 2.077655 (2.556133) 2.581658 / 1.504120 (1.077538) 2.408486 / 1.541195 (0.867291) 2.460491 / 1.468490 (0.992001) 0.507512 / 4.584777 (-4.077265) 3.190363 / 3.745712 (-0.555349) 2.895581 / 5.269862 (-2.374280) 1.871506 / 4.565676 (-2.694171) 0.058469 / 0.424275 (-0.365806) 0.006526 / 0.007607 (-0.001082) 0.537641 / 0.226044 (0.311596) 5.396660 / 2.268929 (3.127731) 3.027028 / 55.444624 (-52.417596) 2.703771 / 6.876477 (-4.172705) 2.865576 / 2.142072 (0.723503) 0.600103 / 4.805227 (-4.205124) 0.127109 / 6.500664 (-6.373555) 0.060985 / 0.075469 (-0.014484)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.365030 / 1.841788 (-0.476758) 17.988218 / 8.074308 (9.913909) 14.900796 / 10.191392 (4.709404) 0.158211 / 0.680424 (-0.522213) 0.018291 / 0.534201 (-0.515910) 0.337437 / 0.579283 (-0.241846) 0.383710 / 0.434364 (-0.050654) 0.392341 / 0.540337 (-0.147997) 0.561584 / 1.386936 (-0.825352)

@mariosasko
Copy link
Collaborator Author

CI failures are unrelated

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix !

@mariosasko mariosasko marked this pull request as draft October 6, 2023 13:33
@mariosasko
Copy link
Collaborator Author

I also plan to address #6280 (comment) in this PR :).

@lhoestq
Copy link
Member

lhoestq commented Oct 6, 2023

Oh ok, ping me again whenever you want another review :)

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2023

Have you had a chance to continue this ? I can also take a look if you want

@mariosasko
Copy link
Collaborator Author

Yes, I'll finish it next week :).

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@mariosasko
Copy link
Collaborator Author

@lhoestq Feel free to review this again. I've bumped PyArrow to 12.0.0 to simplify the implementation (no need for custom array_concat and less pa.Array.from_buffers). However, this makes apache-beam complain as it only supports <12.0.0. The next apache-beam release will set this boundary to <15.0.0., so I think the only solution is to wait for it to be published.

@mariosasko mariosasko changed the title Fix array.values handling in array cast/embed Fix array cast/embed with null values Jan 31, 2024
@mariosasko mariosasko marked this pull request as ready for review February 1, 2024 00:00
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Handling extension types can be quite complicated

Btw if you have some pyarrow issues we can link to this PR feel free to add them, this way we can follow the advancements and maybe later simplify this code

if array_type != storage_type:
# Temporarily convert to the storage type to support extension types in the slice operation
array = _c(array, storage_type)
array = pc.list_slice(array, 0, pa_type.list_size, return_fixed_size_list=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may bring the data in memory no ?

maybe it's fine for now though

Copy link
Collaborator Author

@mariosasko mariosasko Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pc.list_slice brings the array in memory. Unfortunately, I don't think it's possible to avoid this as we need to "expand" the null values in the .values array to prepare them for the ListArray -> FixedSizeListArray cast, which requires a memory allocation.

We (usually) run these casts on subtables before writing them to disk, so this solution should be fine for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting ListArray -> FixedSizeListArray using PyArrow's ListArray.cast also allocates memory if the array contains null lists, so indeed there isn't much we can do about this due to the difference in the null values storage layout.

Co-authored-by: Quentin Lhoest <[email protected]>
@mariosasko mariosasko merged commit ac05bac into main Feb 6, 2024
@mariosasko mariosasko deleted the fix-array_values branch February 6, 2024 19:24
@github-actions
Copy link

github-actions bot commented Feb 6, 2024

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005188 / 0.011353 (-0.006165) 0.003997 / 0.011008 (-0.007011) 0.062642 / 0.038508 (0.024134) 0.028913 / 0.023109 (0.005804) 0.248289 / 0.275898 (-0.027609) 0.268084 / 0.323480 (-0.055396) 0.004093 / 0.007986 (-0.003893) 0.002822 / 0.004328 (-0.001506) 0.048263 / 0.004250 (0.044012) 0.041520 / 0.037052 (0.004468) 0.263277 / 0.258489 (0.004788) 0.289835 / 0.293841 (-0.004006) 0.027621 / 0.128546 (-0.100925) 0.010793 / 0.075646 (-0.064853) 0.207624 / 0.419271 (-0.211648) 0.035597 / 0.043533 (-0.007936) 0.245706 / 0.255139 (-0.009433) 0.268157 / 0.283200 (-0.015043) 0.017310 / 0.141683 (-0.124373) 1.130656 / 1.452155 (-0.321499) 1.162134 / 1.492716 (-0.330583)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.094081 / 0.018006 (0.076075) 0.302298 / 0.000490 (0.301809) 0.000220 / 0.000200 (0.000020) 0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.019072 / 0.037411 (-0.018339) 0.061162 / 0.014526 (0.046636) 0.072820 / 0.176557 (-0.103737) 0.122628 / 0.737135 (-0.614507) 0.074962 / 0.296338 (-0.221377)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.277858 / 0.215209 (0.062649) 2.688478 / 2.077655 (0.610823) 1.397366 / 1.504120 (-0.106754) 1.285078 / 1.541195 (-0.256117) 1.291559 / 1.468490 (-0.176931) 0.553646 / 4.584777 (-4.031131) 2.355737 / 3.745712 (-1.389975) 2.773025 / 5.269862 (-2.496836) 1.731195 / 4.565676 (-2.834481) 0.061372 / 0.424275 (-0.362903) 0.004928 / 0.007607 (-0.002679) 0.321703 / 0.226044 (0.095659) 3.212927 / 2.268929 (0.943999) 1.727104 / 55.444624 (-53.717521) 1.479430 / 6.876477 (-5.397047) 1.513436 / 2.142072 (-0.628637) 0.629913 / 4.805227 (-4.175315) 0.114607 / 6.500664 (-6.386057) 0.041707 / 0.075469 (-0.033762)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.976060 / 1.841788 (-0.865727) 11.575163 / 8.074308 (3.500855) 9.521390 / 10.191392 (-0.670003) 0.138725 / 0.680424 (-0.541699) 0.013752 / 0.534201 (-0.520449) 0.286252 / 0.579283 (-0.293031) 0.263420 / 0.434364 (-0.170944) 0.325531 / 0.540337 (-0.214806) 0.419466 / 1.386936 (-0.967470)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005615 / 0.011353 (-0.005738) 0.003884 / 0.011008 (-0.007124) 0.049563 / 0.038508 (0.011055) 0.032573 / 0.023109 (0.009464) 0.276917 / 0.275898 (0.001019) 0.298403 / 0.323480 (-0.025077) 0.004367 / 0.007986 (-0.003618) 0.002794 / 0.004328 (-0.001534) 0.049105 / 0.004250 (0.044855) 0.045597 / 0.037052 (0.008545) 0.289762 / 0.258489 (0.031273) 0.318440 / 0.293841 (0.024599) 0.051883 / 0.128546 (-0.076664) 0.010644 / 0.075646 (-0.065003) 0.057455 / 0.419271 (-0.361816) 0.033667 / 0.043533 (-0.009866) 0.274424 / 0.255139 (0.019285) 0.295890 / 0.283200 (0.012690) 0.017029 / 0.141683 (-0.124654) 1.130123 / 1.452155 (-0.322031) 1.214827 / 1.492716 (-0.277889)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.094882 / 0.018006 (0.076876) 0.302505 / 0.000490 (0.302015) 0.000228 / 0.000200 (0.000028) 0.000052 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021695 / 0.037411 (-0.015716) 0.075196 / 0.014526 (0.060670) 0.086641 / 0.176557 (-0.089915) 0.124893 / 0.737135 (-0.612243) 0.088765 / 0.296338 (-0.207574)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.303388 / 0.215209 (0.088179) 2.934506 / 2.077655 (0.856852) 1.608607 / 1.504120 (0.104487) 1.494632 / 1.541195 (-0.046563) 1.512801 / 1.468490 (0.044310) 0.558563 / 4.584777 (-4.026214) 2.383212 / 3.745712 (-1.362500) 2.634629 / 5.269862 (-2.635233) 1.729319 / 4.565676 (-2.836357) 0.062345 / 0.424275 (-0.361930) 0.004981 / 0.007607 (-0.002626) 0.358333 / 0.226044 (0.132289) 3.484229 / 2.268929 (1.215301) 2.010043 / 55.444624 (-53.434581) 1.693733 / 6.876477 (-5.182744) 1.824150 / 2.142072 (-0.317922) 0.650835 / 4.805227 (-4.154392) 0.115933 / 6.500664 (-6.384732) 0.041270 / 0.075469 (-0.034199)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.007949 / 1.841788 (-0.833838) 12.000085 / 8.074308 (3.925776) 10.453119 / 10.191392 (0.261727) 0.143583 / 0.680424 (-0.536840) 0.015937 / 0.534201 (-0.518264) 0.286653 / 0.579283 (-0.292631) 0.272359 / 0.434364 (-0.162005) 0.330520 / 0.540337 (-0.209818) 0.417015 / 1.386936 (-0.969921)

@20141888
Copy link

20141888 commented Jul 4, 2024

Still the problem is occured.
Huggingface is sucks 🤮🤮🤮🤮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants