Skip to content

Commit

Permalink
minor
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Apr 23, 2021
1 parent a877fff commit 3bd47e5
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2863,7 +2863,7 @@ def add_elasticsearch_index(
def add_item(self, item: dict, new_fingerprint: str):
"""Add item to Dataset.
.. versionadded:: 1.6
.. versionadded:: 1.7
Args:
item (dict): Item data to be added.
Expand Down

1 comment on commit 3bd47e5

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023291 / 0.011353 (0.011938) 0.016937 / 0.011008 (0.005928) 0.053417 / 0.038508 (0.014908) 0.037130 / 0.023109 (0.014021) 0.340464 / 0.275898 (0.064566) 0.379185 / 0.323480 (0.055705) 0.011118 / 0.007986 (0.003132) 0.005346 / 0.004328 (0.001017) 0.010406 / 0.004250 (0.006156) 0.047169 / 0.037052 (0.010117) 0.332526 / 0.258489 (0.074037) 0.394118 / 0.293841 (0.100277) 0.167362 / 0.128546 (0.038816) 0.150002 / 0.075646 (0.074356) 0.436307 / 0.419271 (0.017036) 0.449345 / 0.043533 (0.405812) 0.394352 / 0.255139 (0.139213) 0.379626 / 0.283200 (0.096427) 1.666113 / 0.141683 (1.524430) 1.912357 / 1.452155 (0.460202) 1.855004 / 1.492716 (0.362288)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.017017 / 0.018006 (-0.000989) 0.000447 / 0.000490 (-0.000042) 0.000192 / 0.000200 (-0.000008) 0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.044164 / 0.037411 (0.006753) 0.027984 / 0.014526 (0.013458) 0.035193 / 0.176557 (-0.141364) 0.047079 / 0.737135 (-0.690057) 0.034213 / 0.296338 (-0.262125)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.497527 / 0.215209 (0.282318) 5.086624 / 2.077655 (3.008969) 2.266212 / 1.504120 (0.762092) 1.971555 / 1.541195 (0.430360) 2.007457 / 1.468490 (0.538967) 7.754794 / 4.584777 (3.170017) 6.854906 / 3.745712 (3.109194) 9.447678 / 5.269862 (4.177816) 7.810007 / 4.565676 (3.244330) 0.769416 / 0.424275 (0.345140) 0.011603 / 0.007607 (0.003996) 0.664535 / 0.226044 (0.438491) 5.786130 / 2.268929 (3.517201) 2.992089 / 55.444624 (-52.452535) 2.641485 / 6.876477 (-4.234991) 2.589287 / 2.142072 (0.447215) 7.282878 / 4.805227 (2.477650) 7.095274 / 6.500664 (0.594609) 8.127791 / 0.075469 (8.052322)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 12.297712 / 1.841788 (10.455925) 14.917005 / 8.074308 (6.842697) 36.947481 / 10.191392 (26.756089) 0.955028 / 0.680424 (0.274604) 0.663083 / 0.534201 (0.128882) 0.910737 / 0.579283 (0.331454) 0.709802 / 0.434364 (0.275438) 0.826266 / 0.540337 (0.285929) 1.672308 / 1.386936 (0.285371)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023565 / 0.011353 (0.012212) 0.016096 / 0.011008 (0.005088) 0.055838 / 0.038508 (0.017330) 0.032574 / 0.023109 (0.009465) 0.333384 / 0.275898 (0.057486) 0.400912 / 0.323480 (0.077432) 0.010940 / 0.007986 (0.002955) 0.005334 / 0.004328 (0.001005) 0.012037 / 0.004250 (0.007786) 0.052322 / 0.037052 (0.015269) 0.326145 / 0.258489 (0.067656) 0.415697 / 0.293841 (0.121856) 0.161711 / 0.128546 (0.033164) 0.165079 / 0.075646 (0.089433) 0.449052 / 0.419271 (0.029780) 0.408106 / 0.043533 (0.364573) 0.326613 / 0.255139 (0.071474) 0.403282 / 0.283200 (0.120082) 1.637082 / 0.141683 (1.495399) 1.692463 / 1.452155 (0.240308) 1.737079 / 1.492716 (0.244362)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.019015 / 0.018006 (0.001009) 0.000458 / 0.000490 (-0.000032) 0.000181 / 0.000200 (-0.000019) 0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042921 / 0.037411 (0.005510) 0.026583 / 0.014526 (0.012058) 0.032388 / 0.176557 (-0.144168) 0.054298 / 0.737135 (-0.682837) 0.033155 / 0.296338 (-0.263183)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.461907 / 0.215209 (0.246698) 4.831854 / 2.077655 (2.754199) 2.270505 / 1.504120 (0.766385) 1.917343 / 1.541195 (0.376149) 2.066533 / 1.468490 (0.598043) 7.352465 / 4.584777 (2.767688) 6.533295 / 3.745712 (2.787583) 9.075233 / 5.269862 (3.805371) 8.006533 / 4.565676 (3.440856) 0.662079 / 0.424275 (0.237804) 0.010258 / 0.007607 (0.002651) 0.541813 / 0.226044 (0.315768) 5.700716 / 2.268929 (3.431787) 2.992220 / 55.444624 (-52.452404) 2.555544 / 6.876477 (-4.320933) 2.678521 / 2.142072 (0.536449) 7.332535 / 4.805227 (2.527307) 6.469089 / 6.500664 (-0.031575) 10.914750 / 0.075469 (10.839281)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 11.042093 / 1.841788 (9.200305) 12.683185 / 8.074308 (4.608877) 37.748799 / 10.191392 (27.557406) 0.810524 / 0.680424 (0.130100) 0.612285 / 0.534201 (0.078084) 0.805593 / 0.579283 (0.226310) 0.670978 / 0.434364 (0.236614) 0.791104 / 0.540337 (0.250767) 1.586753 / 1.386936 (0.199817)

CML watermark

Please sign in to comment.