Skip to content

Commit

Permalink
Add type stub
Browse files Browse the repository at this point in the history
  • Loading branch information
SBrandeis committed May 25, 2021
1 parent 633ddcd commit 8743224
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@
license="Apache 2.0",
package_dir={"": "src"},
packages=find_packages("src"),
package_data={"datasets": ["scripts/templates/*"], "datasets.utils.resources": ["*.json", "*.yaml"]},
package_data={"datasets": ["scripts/templates/*", "py.typed"], "datasets.utils.resources": ["*.json", "*.yaml"]},
entry_points={"console_scripts": ["datasets-cli=datasets.commands.datasets_cli:main"]},
install_requires=REQUIRED_PKGS,
extras_require=EXTRAS_REQUIRE,
Expand Down
Empty file added src/datasets/py.typed
Empty file.

1 comment on commit 8743224

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.026189 / 0.011353 (0.014836) 0.019414 / 0.011008 (0.008405) 0.052683 / 0.038508 (0.014175) 0.040980 / 0.023109 (0.017870) 0.372202 / 0.275898 (0.096303) 0.386760 / 0.323480 (0.063280) 0.012675 / 0.007986 (0.004689) 0.005815 / 0.004328 (0.001486) 0.012810 / 0.004250 (0.008560) 0.052053 / 0.037052 (0.015001) 0.381498 / 0.258489 (0.123009) 0.426056 / 0.293841 (0.132215) 0.186900 / 0.128546 (0.058354) 0.149690 / 0.075646 (0.074044) 0.492781 / 0.419271 (0.073509) 0.460093 / 0.043533 (0.416560) 0.396020 / 0.255139 (0.140881) 0.464141 / 0.283200 (0.180941) 1.848943 / 0.141683 (1.707261) 1.913901 / 1.452155 (0.461746) 2.004864 / 1.492716 (0.512148)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.009612 / 0.018006 (-0.008394) 0.578322 / 0.000490 (0.577832) 0.000293 / 0.000200 (0.000093) 0.000066 / 0.000054 (0.000012)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.049097 / 0.037411 (0.011686) 0.029905 / 0.014526 (0.015379) 0.031924 / 0.176557 (-0.144632) 0.053297 / 0.737135 (-0.683838) 0.032552 / 0.296338 (-0.263787)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.530236 / 0.215209 (0.315027) 5.378895 / 2.077655 (3.301241) 2.541967 / 1.504120 (1.037847) 2.197788 / 1.541195 (0.656593) 2.244782 / 1.468490 (0.776292) 8.055904 / 4.584777 (3.471127) 7.349002 / 3.745712 (3.603289) 10.152408 / 5.269862 (4.882547) 9.159328 / 4.565676 (4.593651) 0.790237 / 0.424275 (0.365962) 0.012808 / 0.007607 (0.005201) 0.666394 / 0.226044 (0.440350) 6.897102 / 2.268929 (4.628173) 3.214011 / 55.444624 (-52.230613) 2.592196 / 6.876477 (-4.284281) 2.705511 / 2.142072 (0.563439) 8.159321 / 4.805227 (3.354093) 6.968478 / 6.500664 (0.467814) 7.388899 / 0.075469 (7.313430)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 13.337259 / 1.841788 (11.495472) 14.594259 / 8.074308 (6.519950) 42.183136 / 10.191392 (31.991744) 1.051624 / 0.680424 (0.371200) 0.708242 / 0.534201 (0.174041) 0.905551 / 0.579283 (0.326268) 0.714386 / 0.434364 (0.280022) 0.836555 / 0.540337 (0.296217) 1.860102 / 1.386936 (0.473166)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.028044 / 0.011353 (0.016691) 0.017827 / 0.011008 (0.006819) 0.056426 / 0.038508 (0.017918) 0.042688 / 0.023109 (0.019579) 0.370461 / 0.275898 (0.094563) 0.401526 / 0.323480 (0.078046) 0.013363 / 0.007986 (0.005377) 0.006006 / 0.004328 (0.001678) 0.013079 / 0.004250 (0.008828) 0.064562 / 0.037052 (0.027510) 0.379967 / 0.258489 (0.121478) 0.420302 / 0.293841 (0.126461) 0.190987 / 0.128546 (0.062441) 0.155774 / 0.075646 (0.080127) 0.503711 / 0.419271 (0.084440) 0.486126 / 0.043533 (0.442593) 0.370384 / 0.255139 (0.115245) 0.403741 / 0.283200 (0.120541) 1.890361 / 0.141683 (1.748679) 2.043832 / 1.452155 (0.591677) 2.084745 / 1.492716 (0.592029)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.011185 / 0.018006 (-0.006821) 0.575827 / 0.000490 (0.575338) 0.000468 / 0.000200 (0.000268) 0.000064 / 0.000054 (0.000010)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.045560 / 0.037411 (0.008148) 0.029460 / 0.014526 (0.014934) 0.030908 / 0.176557 (-0.145649) 0.051524 / 0.737135 (-0.685612) 0.035028 / 0.296338 (-0.261310)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.501250 / 0.215209 (0.286041) 5.041439 / 2.077655 (2.963784) 2.373683 / 1.504120 (0.869563) 2.027450 / 1.541195 (0.486255) 1.966869 / 1.468490 (0.498379) 7.746145 / 4.584777 (3.161368) 6.900830 / 3.745712 (3.155118) 9.625888 / 5.269862 (4.356027) 8.703651 / 4.565676 (4.137975) 0.801127 / 0.424275 (0.376851) 0.011944 / 0.007607 (0.004337) 0.670734 / 0.226044 (0.444689) 6.529807 / 2.268929 (4.260879) 3.024174 / 55.444624 (-52.420451) 2.394399 / 6.876477 (-4.482077) 2.543008 / 2.142072 (0.400935) 8.093930 / 4.805227 (3.288702) 6.432153 / 6.500664 (-0.068511) 8.597261 / 0.075469 (8.521792)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 13.008618 / 1.841788 (11.166830) 14.748005 / 8.074308 (6.673697) 41.592620 / 10.191392 (31.401228) 0.971684 / 0.680424 (0.291260) 0.655306 / 0.534201 (0.121105) 0.866066 / 0.579283 (0.286783) 0.705935 / 0.434364 (0.271571) 0.819791 / 0.540337 (0.279454) 1.748611 / 1.386936 (0.361675)

CML watermark

Please sign in to comment.