Minor tqdm fixes #5754

mariosasko · 2023-04-14T18:15:14Z

GeneratorBasedBuilder's TQDM bars were not used as context managers. This PR fixes that (missed these bars in #5560).

Also, this PR modifies the single-proc save_to_disk to fix the issue with the TQDM bar not accumulating the progress in the multi-shard setting (again, this bug was introduced by me in the linked PR 😎)

HuggingFaceDocBuilderDev · 2023-04-14T18:19:13Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Nice fix :)

github-actions · 2023-04-20T15:27:58Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006479 / 0.011353 (-0.004874)	0.004592 / 0.011008 (-0.006416)	0.097239 / 0.038508 (0.058731)	0.028609 / 0.023109 (0.005499)	0.309225 / 0.275898 (0.033327)	0.340015 / 0.323480 (0.016535)	0.004857 / 0.007986 (-0.003129)	0.004649 / 0.004328 (0.000320)	0.074770 / 0.004250 (0.070520)	0.038351 / 0.037052 (0.001299)	0.313360 / 0.258489 (0.054871)	0.350256 / 0.293841 (0.056416)	0.030770 / 0.128546 (-0.097776)	0.011591 / 0.075646 (-0.064055)	0.322444 / 0.419271 (-0.096828)	0.043704 / 0.043533 (0.000171)	0.311790 / 0.255139 (0.056651)	0.339183 / 0.283200 (0.055984)	0.088041 / 0.141683 (-0.053642)	1.490649 / 1.452155 (0.038494)	1.561789 / 1.492716 (0.069072)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208984 / 0.018006 (0.190978)	0.406105 / 0.000490 (0.405616)	0.003152 / 0.000200 (0.002952)	0.000074 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022622 / 0.037411 (-0.014790)	0.095819 / 0.014526 (0.081294)	0.105132 / 0.176557 (-0.071424)	0.165684 / 0.737135 (-0.571451)	0.106706 / 0.296338 (-0.189632)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426126 / 0.215209 (0.210917)	4.233864 / 2.077655 (2.156209)	1.918727 / 1.504120 (0.414607)	1.729905 / 1.541195 (0.188710)	1.760342 / 1.468490 (0.291852)	0.695449 / 4.584777 (-3.889328)	3.413531 / 3.745712 (-0.332181)	1.904557 / 5.269862 (-3.365305)	1.270604 / 4.565676 (-3.295072)	0.083018 / 0.424275 (-0.341257)	0.012760 / 0.007607 (0.005152)	0.523991 / 0.226044 (0.297947)	5.236132 / 2.268929 (2.967204)	2.360959 / 55.444624 (-53.083665)	1.996533 / 6.876477 (-4.879943)	2.072934 / 2.142072 (-0.069138)	0.804133 / 4.805227 (-4.001094)	0.150976 / 6.500664 (-6.349688)	0.065503 / 0.075469 (-0.009966)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.211828 / 1.841788 (-0.629960)	13.657743 / 8.074308 (5.583435)	13.887148 / 10.191392 (3.695756)	0.145996 / 0.680424 (-0.534428)	0.016562 / 0.534201 (-0.517639)	0.380359 / 0.579283 (-0.198924)	0.388698 / 0.434364 (-0.045666)	0.440373 / 0.540337 (-0.099965)	0.531753 / 1.386936 (-0.855183)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006444 / 0.011353 (-0.004909)	0.004569 / 0.011008 (-0.006439)	0.076239 / 0.038508 (0.037731)	0.028462 / 0.023109 (0.005352)	0.365540 / 0.275898 (0.089642)	0.398242 / 0.323480 (0.074762)	0.005785 / 0.007986 (-0.002200)	0.003346 / 0.004328 (-0.000982)	0.076296 / 0.004250 (0.072046)	0.039853 / 0.037052 (0.002800)	0.367684 / 0.258489 (0.109195)	0.409570 / 0.293841 (0.115730)	0.030536 / 0.128546 (-0.098010)	0.011534 / 0.075646 (-0.064112)	0.084962 / 0.419271 (-0.334309)	0.042708 / 0.043533 (-0.000825)	0.344058 / 0.255139 (0.088919)	0.389096 / 0.283200 (0.105897)	0.090559 / 0.141683 (-0.051124)	1.507101 / 1.452155 (0.054946)	1.563977 / 1.492716 (0.071260)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.228740 / 0.018006 (0.210734)	0.396890 / 0.000490 (0.396400)	0.000392 / 0.000200 (0.000192)	0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025052 / 0.037411 (-0.012360)	0.099951 / 0.014526 (0.085426)	0.106847 / 0.176557 (-0.069710)	0.156666 / 0.737135 (-0.580469)	0.110344 / 0.296338 (-0.185994)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.442363 / 0.215209 (0.227154)	4.429571 / 2.077655 (2.351917)	2.076501 / 1.504120 (0.572381)	1.875226 / 1.541195 (0.334031)	1.909093 / 1.468490 (0.440603)	0.703047 / 4.584777 (-3.881730)	3.457036 / 3.745712 (-0.288676)	2.866648 / 5.269862 (-2.403214)	1.524430 / 4.565676 (-3.041246)	0.083687 / 0.424275 (-0.340588)	0.012251 / 0.007607 (0.004643)	0.543945 / 0.226044 (0.317901)	5.440559 / 2.268929 (3.171630)	2.522924 / 55.444624 (-52.921700)	2.188770 / 6.876477 (-4.687707)	2.249632 / 2.142072 (0.107559)	0.813499 / 4.805227 (-3.991728)	0.152861 / 6.500664 (-6.347803)	0.067189 / 0.075469 (-0.008280)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.284255 / 1.841788 (-0.557533)	14.207864 / 8.074308 (6.133556)	14.279691 / 10.191392 (4.088299)	0.167027 / 0.680424 (-0.513396)	0.016455 / 0.534201 (-0.517746)	0.380798 / 0.579283 (-0.198485)	0.390013 / 0.434364 (-0.044351)	0.445493 / 0.540337 (-0.094845)	0.526278 / 1.386936 (-0.860658)

Minor tqdm fixes

dd4ae2e

lhoestq approved these changes Apr 20, 2023

View reviewed changes

mariosasko merged commit 3fdb46c into main Apr 20, 2023

mariosasko deleted the tqdm-fixes branch April 20, 2023 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor tqdm fixes #5754

Minor tqdm fixes #5754

mariosasko commented Apr 14, 2023

HuggingFaceDocBuilderDev commented Apr 14, 2023 •

edited

Loading

lhoestq left a comment

github-actions bot commented Apr 20, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Minor tqdm fixes #5754

Minor tqdm fixes #5754

Conversation

mariosasko commented Apr 14, 2023

HuggingFaceDocBuilderDev commented Apr 14, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 20, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Apr 14, 2023 •

edited

Loading