Use default audio resampling type #5556

lhoestq · 2023-02-21T10:45:50Z

...instead of relying on the optional librosa dependency resampy.

It was only used for _decode_non_mp3_file_like anyway and not for the other ones - removing it fixes consistency between decoding methods (except torchaudio decoding)

Therefore I think it is a better solution than adding resampy as a dependency in #5554

cc @polinaeterna

HuggingFaceDocBuilderDev · 2023-02-21T10:51:10Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-02-21T10:51:21Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008730 / 0.011353 (-0.002623)	0.004551 / 0.011008 (-0.006457)	0.100206 / 0.038508 (0.061698)	0.030264 / 0.023109 (0.007154)	0.303310 / 0.275898 (0.027412)	0.339040 / 0.323480 (0.015560)	0.006923 / 0.007986 (-0.001063)	0.004707 / 0.004328 (0.000379)	0.077822 / 0.004250 (0.073571)	0.034368 / 0.037052 (-0.002684)	0.303125 / 0.258489 (0.044636)	0.348322 / 0.293841 (0.054481)	0.033831 / 0.128546 (-0.094715)	0.011459 / 0.075646 (-0.064187)	0.322092 / 0.419271 (-0.097180)	0.047720 / 0.043533 (0.004187)	0.304849 / 0.255139 (0.049710)	0.330767 / 0.283200 (0.047567)	0.087362 / 0.141683 (-0.054321)	1.536095 / 1.452155 (0.083941)	1.599979 / 1.492716 (0.107263)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.188985 / 0.018006 (0.170979)	0.410775 / 0.000490 (0.410286)	0.004215 / 0.000200 (0.004015)	0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023124 / 0.037411 (-0.014287)	0.096962 / 0.014526 (0.082436)	0.104070 / 0.176557 (-0.072486)	0.141248 / 0.737135 (-0.595887)	0.108534 / 0.296338 (-0.187804)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417118 / 0.215209 (0.201909)	4.167808 / 2.077655 (2.090154)	2.016540 / 1.504120 (0.512420)	1.847812 / 1.541195 (0.306617)	1.967023 / 1.468490 (0.498532)	0.689262 / 4.584777 (-3.895515)	3.378747 / 3.745712 (-0.366965)	1.854126 / 5.269862 (-3.415735)	1.152102 / 4.565676 (-3.413575)	0.081839 / 0.424275 (-0.342437)	0.012426 / 0.007607 (0.004819)	0.521334 / 0.226044 (0.295289)	5.230593 / 2.268929 (2.961664)	2.269386 / 55.444624 (-53.175238)	1.965631 / 6.876477 (-4.910846)	2.028994 / 2.142072 (-0.113079)	0.802142 / 4.805227 (-4.003085)	0.147954 / 6.500664 (-6.352710)	0.065031 / 0.075469 (-0.010438)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.235289 / 1.841788 (-0.606499)	13.723507 / 8.074308 (5.649199)	14.197923 / 10.191392 (4.006531)	0.147950 / 0.680424 (-0.532473)	0.028332 / 0.534201 (-0.505869)	0.400180 / 0.579283 (-0.179103)	0.418970 / 0.434364 (-0.015393)	0.478381 / 0.540337 (-0.061957)	0.576138 / 1.386936 (-0.810798)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006548 / 0.011353 (-0.004805)	0.004567 / 0.011008 (-0.006441)	0.075658 / 0.038508 (0.037150)	0.027190 / 0.023109 (0.004080)	0.363417 / 0.275898 (0.087518)	0.399575 / 0.323480 (0.076095)	0.004982 / 0.007986 (-0.003004)	0.003364 / 0.004328 (-0.000964)	0.074392 / 0.004250 (0.070142)	0.038839 / 0.037052 (0.001787)	0.361133 / 0.258489 (0.102644)	0.408557 / 0.293841 (0.114717)	0.031468 / 0.128546 (-0.097078)	0.011645 / 0.075646 (-0.064001)	0.085145 / 0.419271 (-0.334126)	0.041775 / 0.043533 (-0.001758)	0.348624 / 0.255139 (0.093485)	0.389610 / 0.283200 (0.106410)	0.088576 / 0.141683 (-0.053107)	1.511208 / 1.452155 (0.059054)	1.560568 / 1.492716 (0.067852)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.185017 / 0.018006 (0.167011)	0.407543 / 0.000490 (0.407053)	0.002486 / 0.000200 (0.002286)	0.000076 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025181 / 0.037411 (-0.012231)	0.099056 / 0.014526 (0.084530)	0.108597 / 0.176557 (-0.067959)	0.144664 / 0.737135 (-0.592471)	0.110417 / 0.296338 (-0.185922)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434302 / 0.215209 (0.219093)	4.327840 / 2.077655 (2.250185)	2.059939 / 1.504120 (0.555819)	1.853267 / 1.541195 (0.312072)	1.906616 / 1.468490 (0.438126)	0.700165 / 4.584777 (-3.884611)	3.439216 / 3.745712 (-0.306496)	2.792034 / 5.269862 (-2.477827)	1.424852 / 4.565676 (-3.140824)	0.083926 / 0.424275 (-0.340349)	0.013943 / 0.007607 (0.006336)	0.535964 / 0.226044 (0.309920)	5.368671 / 2.268929 (3.099743)	2.497027 / 55.444624 (-52.947597)	2.166222 / 6.876477 (-4.710254)	2.183766 / 2.142072 (0.041693)	0.805886 / 4.805227 (-3.999341)	0.152474 / 6.500664 (-6.348190)	0.067354 / 0.075469 (-0.008115)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.284052 / 1.841788 (-0.557736)	13.714066 / 8.074308 (5.639758)	14.195212 / 10.191392 (4.003820)	0.151815 / 0.680424 (-0.528609)	0.016847 / 0.534201 (-0.517354)	0.391174 / 0.579283 (-0.188109)	0.409784 / 0.434364 (-0.024580)	0.473880 / 0.540337 (-0.066458)	0.561016 / 1.386936 (-0.825920)

github-actions · 2023-02-21T10:54:21Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010284 / 0.011353 (-0.001068)	0.005654 / 0.011008 (-0.005355)	0.100522 / 0.038508 (0.062014)	0.039201 / 0.023109 (0.016092)	0.320831 / 0.275898 (0.044933)	0.365351 / 0.323480 (0.041871)	0.009066 / 0.007986 (0.001080)	0.005805 / 0.004328 (0.001476)	0.076969 / 0.004250 (0.072719)	0.045813 / 0.037052 (0.008760)	0.327115 / 0.258489 (0.068626)	0.362823 / 0.293841 (0.068982)	0.040521 / 0.128546 (-0.088025)	0.013166 / 0.075646 (-0.062481)	0.358579 / 0.419271 (-0.060692)	0.051753 / 0.043533 (0.008220)	0.323741 / 0.255139 (0.068602)	0.360211 / 0.283200 (0.077011)	0.111534 / 0.141683 (-0.030149)	1.594887 / 1.452155 (0.142732)	1.651516 / 1.492716 (0.158799)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.012051 / 0.018006 (-0.005956)	0.475316 / 0.000490 (0.474826)	0.004804 / 0.000200 (0.004604)	0.000100 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027480 / 0.037411 (-0.009931)	0.112022 / 0.014526 (0.097496)	0.121539 / 0.176557 (-0.055017)	0.166327 / 0.737135 (-0.570809)	0.132575 / 0.296338 (-0.163763)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418322 / 0.215209 (0.203113)	4.149463 / 2.077655 (2.071808)	1.890901 / 1.504120 (0.386781)	1.682521 / 1.541195 (0.141327)	1.716331 / 1.468490 (0.247841)	0.729176 / 4.584777 (-3.855601)	4.173303 / 3.745712 (0.427591)	2.166249 / 5.269862 (-3.103612)	1.384623 / 4.565676 (-3.181053)	0.095486 / 0.424275 (-0.328789)	0.013800 / 0.007607 (0.006193)	0.573917 / 0.226044 (0.347872)	5.348843 / 2.268929 (3.079914)	2.421716 / 55.444624 (-53.022909)	2.002048 / 6.876477 (-4.874428)	2.079493 / 2.142072 (-0.062579)	0.882818 / 4.805227 (-3.922409)	0.172936 / 6.500664 (-6.327728)	0.068384 / 0.075469 (-0.007085)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.285704 / 1.841788 (-0.556084)	16.036346 / 8.074308 (7.962038)	15.181557 / 10.191392 (4.990165)	0.194044 / 0.680424 (-0.486380)	0.033128 / 0.534201 (-0.501073)	0.480290 / 0.579283 (-0.098993)	0.497525 / 0.434364 (0.063161)	0.602304 / 0.540337 (0.061966)	0.754273 / 1.386936 (-0.632663)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007263 / 0.011353 (-0.004090)	0.005164 / 0.011008 (-0.005845)	0.079833 / 0.038508 (0.041325)	0.033974 / 0.023109 (0.010865)	0.382057 / 0.275898 (0.106159)	0.402924 / 0.323480 (0.079444)	0.007273 / 0.007986 (-0.000712)	0.004378 / 0.004328 (0.000049)	0.080556 / 0.004250 (0.076305)	0.047376 / 0.037052 (0.010324)	0.379044 / 0.258489 (0.120555)	0.422135 / 0.293841 (0.128294)	0.038294 / 0.128546 (-0.090252)	0.013974 / 0.075646 (-0.061672)	0.094936 / 0.419271 (-0.324335)	0.051033 / 0.043533 (0.007501)	0.368197 / 0.255139 (0.113058)	0.409627 / 0.283200 (0.126427)	0.107365 / 0.141683 (-0.034318)	1.537501 / 1.452155 (0.085346)	1.618021 / 1.492716 (0.125305)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227187 / 0.018006 (0.209181)	0.473226 / 0.000490 (0.472736)	0.006532 / 0.000200 (0.006332)	0.000121 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029814 / 0.037411 (-0.007597)	0.121113 / 0.014526 (0.106587)	0.125107 / 0.176557 (-0.051450)	0.167008 / 0.737135 (-0.570127)	0.128720 / 0.296338 (-0.167619)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.452158 / 0.215209 (0.236949)	4.507087 / 2.077655 (2.429433)	2.193910 / 1.504120 (0.689790)	1.991234 / 1.541195 (0.450039)	2.120256 / 1.468490 (0.651766)	0.726664 / 4.584777 (-3.858113)	4.213148 / 3.745712 (0.467436)	4.082857 / 5.269862 (-1.187005)	1.741018 / 4.565676 (-2.824658)	0.090176 / 0.424275 (-0.334099)	0.013221 / 0.007607 (0.005614)	0.558868 / 0.226044 (0.332824)	5.617242 / 2.268929 (3.348313)	2.985430 / 55.444624 (-52.459194)	2.623136 / 6.876477 (-4.253341)	2.383177 / 2.142072 (0.241105)	0.917237 / 4.805227 (-3.887990)	0.178774 / 6.500664 (-6.321890)	0.064707 / 0.075469 (-0.010762)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.365402 / 1.841788 (-0.476385)	16.035773 / 8.074308 (7.961465)	13.917612 / 10.191392 (3.726220)	0.152191 / 0.680424 (-0.528233)	0.020734 / 0.534201 (-0.513467)	0.442055 / 0.579283 (-0.137228)	0.470588 / 0.434364 (0.036224)	0.563433 / 0.540337 (0.023096)	0.651161 / 1.386936 (-0.735775)

lhoestq · 2023-02-21T11:22:27Z

If it's good for you @polinaeterna I'd like to merge it and then run the transformers CI to see if it changes anything

polinaeterna

@lhoestq lgtm thank you!

github-actions · 2023-02-21T12:49:49Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008829 / 0.011353 (-0.002524)	0.004652 / 0.011008 (-0.006356)	0.102505 / 0.038508 (0.063997)	0.030164 / 0.023109 (0.007054)	0.306551 / 0.275898 (0.030653)	0.368920 / 0.323480 (0.045440)	0.007084 / 0.007986 (-0.000902)	0.003545 / 0.004328 (-0.000783)	0.079402 / 0.004250 (0.075152)	0.035296 / 0.037052 (-0.001756)	0.312010 / 0.258489 (0.053520)	0.348773 / 0.293841 (0.054932)	0.034622 / 0.128546 (-0.093924)	0.011727 / 0.075646 (-0.063920)	0.326911 / 0.419271 (-0.092361)	0.043832 / 0.043533 (0.000300)	0.306357 / 0.255139 (0.051218)	0.328744 / 0.283200 (0.045544)	0.091954 / 0.141683 (-0.049729)	1.563989 / 1.452155 (0.111834)	1.591901 / 1.492716 (0.099185)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.194955 / 0.018006 (0.176948)	0.412864 / 0.000490 (0.412374)	0.003710 / 0.000200 (0.003510)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023132 / 0.037411 (-0.014279)	0.099586 / 0.014526 (0.085060)	0.105031 / 0.176557 (-0.071525)	0.141206 / 0.737135 (-0.595929)	0.111978 / 0.296338 (-0.184360)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.413729 / 0.215209 (0.198520)	4.161713 / 2.077655 (2.084058)	1.887442 / 1.504120 (0.383322)	1.711847 / 1.541195 (0.170653)	1.756833 / 1.468490 (0.288343)	0.699239 / 4.584777 (-3.885538)	3.346213 / 3.745712 (-0.399499)	2.822289 / 5.269862 (-2.447573)	1.475650 / 4.565676 (-3.090027)	0.082800 / 0.424275 (-0.341475)	0.012302 / 0.007607 (0.004695)	0.523068 / 0.226044 (0.297024)	5.242833 / 2.268929 (2.973904)	2.310768 / 55.444624 (-53.133856)	1.954629 / 6.876477 (-4.921847)	2.015563 / 2.142072 (-0.126510)	0.812613 / 4.805227 (-3.992614)	0.149512 / 6.500664 (-6.351152)	0.065162 / 0.075469 (-0.010307)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.270177 / 1.841788 (-0.571610)	13.664765 / 8.074308 (5.590457)	14.317968 / 10.191392 (4.126576)	0.138135 / 0.680424 (-0.542289)	0.028503 / 0.534201 (-0.505698)	0.402921 / 0.579283 (-0.176362)	0.400999 / 0.434364 (-0.033365)	0.470983 / 0.540337 (-0.069355)	0.544319 / 1.386936 (-0.842617)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006841 / 0.011353 (-0.004512)	0.004570 / 0.011008 (-0.006439)	0.076398 / 0.038508 (0.037890)	0.028136 / 0.023109 (0.005027)	0.339864 / 0.275898 (0.063966)	0.375496 / 0.323480 (0.052016)	0.004967 / 0.007986 (-0.003019)	0.003411 / 0.004328 (-0.000917)	0.075727 / 0.004250 (0.071476)	0.040025 / 0.037052 (0.002973)	0.340473 / 0.258489 (0.081984)	0.384396 / 0.293841 (0.090555)	0.031683 / 0.128546 (-0.096863)	0.011752 / 0.075646 (-0.063894)	0.085635 / 0.419271 (-0.333636)	0.042764 / 0.043533 (-0.000769)	0.339417 / 0.255139 (0.084278)	0.364190 / 0.283200 (0.080991)	0.093842 / 0.141683 (-0.047841)	1.480999 / 1.452155 (0.028844)	1.549752 / 1.492716 (0.057036)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.174146 / 0.018006 (0.156140)	0.415459 / 0.000490 (0.414970)	0.002854 / 0.000200 (0.002654)	0.000077 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024671 / 0.037411 (-0.012740)	0.101229 / 0.014526 (0.086703)	0.108841 / 0.176557 (-0.067716)	0.144530 / 0.737135 (-0.592606)	0.112509 / 0.296338 (-0.183829)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.460561 / 0.215209 (0.245352)	4.591139 / 2.077655 (2.513484)	2.275535 / 1.504120 (0.771415)	2.070976 / 1.541195 (0.529781)	2.028766 / 1.468490 (0.560276)	0.706166 / 4.584777 (-3.878611)	3.408498 / 3.745712 (-0.337215)	3.034665 / 5.269862 (-2.235197)	1.586805 / 4.565676 (-2.978872)	0.083355 / 0.424275 (-0.340920)	0.012460 / 0.007607 (0.004853)	0.565256 / 0.226044 (0.339212)	5.662643 / 2.268929 (3.393715)	2.697019 / 55.444624 (-52.747605)	2.302044 / 6.876477 (-4.574433)	2.373081 / 2.142072 (0.231009)	0.809804 / 4.805227 (-3.995423)	0.151481 / 6.500664 (-6.349183)	0.066870 / 0.075469 (-0.008599)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.257293 / 1.841788 (-0.584495)	14.059454 / 8.074308 (5.985146)	13.783251 / 10.191392 (3.591859)	0.140007 / 0.680424 (-0.540417)	0.016624 / 0.534201 (-0.517577)	0.381703 / 0.579283 (-0.197580)	0.389032 / 0.434364 (-0.045332)	0.466127 / 0.540337 (-0.074211)	0.551052 / 1.386936 (-0.835884)

* use default audio resampling type * style

This reverts commit ea530d7.

use default audio resampling type

47ab08d

lhoestq mentioned this pull request Feb 21, 2023

Add resampy dep #5554

Closed

style

6ab909a

polinaeterna approved these changes Feb 21, 2023

View reviewed changes

lhoestq merged commit 4a767f7 into main Feb 21, 2023

lhoestq deleted the use-default-audio-res_type branch February 21, 2023 12:42

AJDERS pushed a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Use default audio resampling type (huggingface#5556)

ea530d7

* use default audio resampling type * style

AJDERS added a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Revert "Use default audio resampling type (huggingface#5556)"

abb6e61

This reverts commit ea530d7.

Use default audio resampling type #5556

Use default audio resampling type #5556

Conversation

lhoestq commented Feb 21, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Feb 21, 2023 • edited Loading

github-actions bot commented Feb 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Feb 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Feb 21, 2023

polinaeterna left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Feb 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 21, 2023 •

edited

Loading