docs: resolving namespace conflict, refactored variable #6312

smty2018 · 2023-10-18T16:10:59Z

In docs of about_arrow.md, in the below example code

The variable name 'time' was being used in a way that could potentially lead to a namespace conflict with Python's built-in 'time' module. It is not a good convention and can lead to unintended variable shadowing for any user re-using the example code.
To ensure code clarity, and prevent potential naming conflicts renamed the variable 'time' to 'elapsed_time' in the example code.

stevhliu

Thanks for improving the clarity! Pinging @lhoestq for a final look :)

github-actions · 2023-10-19T16:31:58Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006209 / 0.011353 (-0.005144)	0.003708 / 0.011008 (-0.007300)	0.080435 / 0.038508 (0.041926)	0.060105 / 0.023109 (0.036995)	0.392962 / 0.275898 (0.117064)	0.429381 / 0.323480 (0.105902)	0.003596 / 0.007986 (-0.004390)	0.003849 / 0.004328 (-0.000480)	0.062377 / 0.004250 (0.058127)	0.048718 / 0.037052 (0.011666)	0.400906 / 0.258489 (0.142417)	0.440335 / 0.293841 (0.146494)	0.027807 / 0.128546 (-0.100739)	0.008066 / 0.075646 (-0.067580)	0.262542 / 0.419271 (-0.156730)	0.045513 / 0.043533 (0.001980)	0.399608 / 0.255139 (0.144469)	0.418007 / 0.283200 (0.134807)	0.023475 / 0.141683 (-0.118208)	1.476563 / 1.452155 (0.024409)	1.528898 / 1.492716 (0.036182)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223798 / 0.018006 (0.205792)	0.430526 / 0.000490 (0.430036)	0.009232 / 0.000200 (0.009032)	0.000082 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024921 / 0.037411 (-0.012490)	0.077692 / 0.014526 (0.063166)	0.085382 / 0.176557 (-0.091174)	0.146220 / 0.737135 (-0.590915)	0.086396 / 0.296338 (-0.209943)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.439986 / 0.215209 (0.224777)	4.384552 / 2.077655 (2.306897)	2.373697 / 1.504120 (0.869577)	2.176138 / 1.541195 (0.634943)	2.225914 / 1.468490 (0.757424)	0.505776 / 4.584777 (-4.079001)	3.053744 / 3.745712 (-0.691968)	3.080443 / 5.269862 (-2.189419)	1.904392 / 4.565676 (-2.661285)	0.058112 / 0.424275 (-0.366163)	0.006631 / 0.007607 (-0.000976)	0.503409 / 0.226044 (0.277365)	5.053375 / 2.268929 (2.784447)	2.789963 / 55.444624 (-52.654661)	2.452659 / 6.876477 (-4.423818)	2.512353 / 2.142072 (0.370280)	0.590095 / 4.805227 (-4.215132)	0.126267 / 6.500664 (-6.374397)	0.061246 / 0.075469 (-0.014223)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.249884 / 1.841788 (-0.591903)	17.684730 / 8.074308 (9.610422)	13.967467 / 10.191392 (3.776075)	0.144202 / 0.680424 (-0.536222)	0.017004 / 0.534201 (-0.517197)	0.333634 / 0.579283 (-0.245649)	0.387251 / 0.434364 (-0.047113)	0.390189 / 0.540337 (-0.150148)	0.535662 / 1.386936 (-0.851274)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006379 / 0.011353 (-0.004974)	0.003681 / 0.011008 (-0.007327)	0.063005 / 0.038508 (0.024497)	0.064221 / 0.023109 (0.041112)	0.446074 / 0.275898 (0.170176)	0.471997 / 0.323480 (0.148517)	0.005074 / 0.007986 (-0.002911)	0.002945 / 0.004328 (-0.001383)	0.063305 / 0.004250 (0.059054)	0.050608 / 0.037052 (0.013556)	0.443260 / 0.258489 (0.184771)	0.478497 / 0.293841 (0.184656)	0.028980 / 0.128546 (-0.099566)	0.008145 / 0.075646 (-0.067502)	0.068412 / 0.419271 (-0.350859)	0.041552 / 0.043533 (-0.001980)	0.436649 / 0.255139 (0.181510)	0.462397 / 0.283200 (0.179198)	0.019929 / 0.141683 (-0.121753)	1.530248 / 1.452155 (0.078093)	1.611117 / 1.492716 (0.118401)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232894 / 0.018006 (0.214888)	0.421451 / 0.000490 (0.420961)	0.003984 / 0.000200 (0.003784)	0.000084 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027776 / 0.037411 (-0.009635)	0.081632 / 0.014526 (0.067106)	0.094031 / 0.176557 (-0.082526)	0.147930 / 0.737135 (-0.589206)	0.094226 / 0.296338 (-0.202112)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.471722 / 0.215209 (0.256513)	4.713241 / 2.077655 (2.635587)	2.662660 / 1.504120 (1.158540)	2.490778 / 1.541195 (0.949583)	2.555786 / 1.468490 (1.087296)	0.512209 / 4.584777 (-4.072568)	3.210612 / 3.745712 (-0.535100)	2.863346 / 5.269862 (-2.406516)	1.884664 / 4.565676 (-2.681012)	0.058514 / 0.424275 (-0.365761)	0.006473 / 0.007607 (-0.001134)	0.543279 / 0.226044 (0.317235)	5.441485 / 2.268929 (3.172556)	3.145398 / 55.444624 (-52.299226)	2.749603 / 6.876477 (-4.126874)	2.925738 / 2.142072 (0.783666)	0.598725 / 4.805227 (-4.206502)	0.125616 / 6.500664 (-6.375048)	0.061314 / 0.075469 (-0.014155)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.384270 / 1.841788 (-0.457518)	18.307618 / 8.074308 (10.233310)	14.635768 / 10.191392 (4.444376)	0.148787 / 0.680424 (-0.531637)	0.018191 / 0.534201 (-0.516010)	0.333166 / 0.579283 (-0.246117)	0.405116 / 0.434364 (-0.029247)	0.392798 / 0.540337 (-0.147540)	0.582299 / 1.386936 (-0.804637)

smty2018 added 2 commits October 18, 2023 13:44

Update about_arrow.md

eb1e0dd

Merge branch 'huggingface:main' into patch1

991b8bc

stevhliu approved these changes Oct 19, 2023

View reviewed changes

lhoestq approved these changes Oct 19, 2023

View reviewed changes

lhoestq merged commit 7004f0f into huggingface:main Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: resolving namespace conflict, refactored variable #6312

docs: resolving namespace conflict, refactored variable #6312

Uh oh!

smty2018 commented Oct 18, 2023

Uh oh!

stevhliu left a comment

Uh oh!

github-actions bot commented Oct 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

docs: resolving namespace conflict, refactored variable #6312

docs: resolving namespace conflict, refactored variable #6312

Uh oh!

Conversation

smty2018 commented Oct 18, 2023

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants