Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip dataset verifications by default #5303

Merged
merged 18 commits into from
Feb 13, 2023
Merged

Skip dataset verifications by default #5303

merged 18 commits into from
Feb 13, 2023

Conversation

mariosasko
Copy link
Collaborator

Skip the dataset verifications (split and checksum verifications, duplicate keys check) by default unless a dataset is being tested (datasets-cli test/run_beam). The main goal is to avoid running the checksum check in the default case due to how expensive it can be for large datasets.

PS: Maybe we should deprecate ignore_verifications, which is True now by default, and give it a different name?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 25, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member

lhoestq commented Dec 7, 2022

100% agree that the checksum verification is overkill and not super useful. But I think this PR would also disable the check on num_examples no ?

As a user I would like to know if the dataset I'm loading changed significantly.
And I also think it can be useful to make sure the metadata are up to date.

What do you think ?

We could have a default ignore_verifications="ignore_checksums"

@mariosasko
Copy link
Collaborator Author

We could have a default ignore_verifications="ignore_checksums"

Accepting multiple types (booleans and strings) at the same time is not the best design. Maybe we could define an enum for this parameter?

@lhoestq
Copy link
Member

lhoestq commented Dec 7, 2022

Yes an enum sounds good !

@polinaeterna
Copy link
Contributor

so we can have three verification levels, - smth like "ignore_all" (to skip both checksums and all other info like num_examples verification), "ignore_checksums" (to skip only checksums verification), and "verify_all" (to perform all verification)?
and deprecate ignore_verifications param.

@mariosasko if you're not going to work on this PR in the coming days, I can take over it if you want (this PR will help me with this issue, not super urgent though).

@mariosasko
Copy link
Collaborator Author

Okay, I propose deprecating ignore_verifications in favor of verification_mode (load_dataset already has download_mode; some other projects use this name for verification control). verification_mode would accept the following enum (or strings in the same manner as download_mode does):

class VerificationMode(enum.Enum):
    FULL = "full"           # runs all verification checks 
    BASIC = "basic"     # default, runs only the cheap ones (skips the checksum check)
    NONE = "none"      # skips all the checks

WDTY?

@lhoestq
Copy link
Member

lhoestq commented Jan 24, 2023

(copy paste from my message on slack)

What do you think of a config variable in config.py to switch from one verification mode to another ? This way we don’t deprecate anything

Many users are familiar with ignore_verifications=True, it might be overkill to deprecate it

@polinaeterna
Copy link
Contributor

polinaeterna commented Jan 25, 2023

@lhoestq So we have "basic" verification mode in config.py and continue to have False as a default
value for ignore_verifications? That way running all verifications including checksums would not be possible without switching the config var, right?

I like having a VerificationMode enum because it's aligned with DownloadMode and sounds more natural to me (ignore_verifications feels a bit semantically reverted but this is probably just my feeling) and it's flexible (no need to worry about config.py, I'm not sure that users even know it exists, wdyt?).

The usage point seems also valid to me, but cases when users are stuck with NonMatchingX errors also happen from time to time and to figure out what's wrong is non-trivial here.

As a note aside - I suggest to add instructions to the NonMatchingX error message (how to use ignore_verifications / verification_mode), this would save users who don't know about this param a lot of time.

@lhoestq
Copy link
Member

lhoestq commented Jan 25, 2023

Ok I see. I'm fine with the new parameter then (even though I had a small pref for the config variable) :)

@albertvillanova
Copy link
Member

I like the idea of an enum and the verification_mode parameter.

In relation with the config parameter, we could additionally add a DEFAULT_VERIFICATION_MODE, maybe only if users require it. Note that until now there wasn't any config parameter for a default ignore_verifications value: I guess people are explicitly passing ignore_verifications=True...

As a note aside, I like the suggestion by @polinaeterna: we could give actionable messages when verifying checksums. This could be done in other PR.

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012891 / 0.011353 (0.001538) 0.006474 / 0.011008 (-0.004535) 0.144038 / 0.038508 (0.105530) 0.036151 / 0.023109 (0.013042) 0.404366 / 0.275898 (0.128468) 0.479988 / 0.323480 (0.156508) 0.010219 / 0.007986 (0.002233) 0.005319 / 0.004328 (0.000990) 0.099705 / 0.004250 (0.095455) 0.046639 / 0.037052 (0.009586) 0.398997 / 0.258489 (0.140508) 0.478431 / 0.293841 (0.184590) 0.069125 / 0.128546 (-0.059421) 0.019603 / 0.075646 (-0.056043) 0.400829 / 0.419271 (-0.018443) 0.066549 / 0.043533 (0.023016) 0.398343 / 0.255139 (0.143204) 0.417928 / 0.283200 (0.134728) 0.121124 / 0.141683 (-0.020559) 1.751513 / 1.452155 (0.299358) 1.821239 / 1.492716 (0.328523)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.251603 / 0.018006 (0.233597) 0.579916 / 0.000490 (0.579427) 0.003257 / 0.000200 (0.003058) 0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031502 / 0.037411 (-0.005909) 0.134688 / 0.014526 (0.120162) 0.152306 / 0.176557 (-0.024251) 0.198943 / 0.737135 (-0.538192) 0.142551 / 0.296338 (-0.153788)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.634672 / 0.215209 (0.419463) 6.370215 / 2.077655 (4.292561) 2.548123 / 1.504120 (1.044003) 2.184263 / 1.541195 (0.643069) 2.239026 / 1.468490 (0.770536) 1.233340 / 4.584777 (-3.351437) 5.791824 / 3.745712 (2.046112) 5.093032 / 5.269862 (-0.176830) 2.849833 / 4.565676 (-1.715844) 0.143787 / 0.424275 (-0.280488) 0.015279 / 0.007607 (0.007672) 0.757984 / 0.226044 (0.531939) 7.883604 / 2.268929 (5.614675) 3.321591 / 55.444624 (-52.123033) 2.671777 / 6.876477 (-4.204700) 2.685215 / 2.142072 (0.543142) 1.546709 / 4.805227 (-3.258519) 0.247186 / 6.500664 (-6.253478) 0.085117 / 0.075469 (0.009648)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.679809 / 1.841788 (-0.161979) 18.528893 / 8.074308 (10.454585) 23.168590 / 10.191392 (12.977198) 0.277618 / 0.680424 (-0.402806) 0.045109 / 0.534201 (-0.489092) 0.568873 / 0.579283 (-0.010410) 0.695017 / 0.434364 (0.260653) 0.671024 / 0.540337 (0.130687) 0.823817 / 1.386936 (-0.563119)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009809 / 0.011353 (-0.001544) 0.006890 / 0.011008 (-0.004118) 0.099211 / 0.038508 (0.060703) 0.035387 / 0.023109 (0.012278) 0.507603 / 0.275898 (0.231705) 0.535553 / 0.323480 (0.212073) 0.007346 / 0.007986 (-0.000640) 0.007559 / 0.004328 (0.003231) 0.099132 / 0.004250 (0.094882) 0.048048 / 0.037052 (0.010996) 0.518096 / 0.258489 (0.259607) 0.561134 / 0.293841 (0.267294) 0.057580 / 0.128546 (-0.070966) 0.023665 / 0.075646 (-0.051982) 0.138409 / 0.419271 (-0.280862) 0.061989 / 0.043533 (0.018456) 0.510568 / 0.255139 (0.255429) 0.552722 / 0.283200 (0.269522) 0.115990 / 0.141683 (-0.025693) 1.884900 / 1.452155 (0.432745) 1.990604 / 1.492716 (0.497888)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.280638 / 0.018006 (0.262632) 0.592837 / 0.000490 (0.592347) 0.000465 / 0.000200 (0.000265) 0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030253 / 0.037411 (-0.007158) 0.141580 / 0.014526 (0.127054) 0.135114 / 0.176557 (-0.041443) 0.190003 / 0.737135 (-0.547133) 0.160230 / 0.296338 (-0.136109)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.699762 / 0.215209 (0.484553) 6.632344 / 2.077655 (4.554689) 2.718803 / 1.504120 (1.214683) 2.485294 / 1.541195 (0.944099) 2.579889 / 1.468490 (1.111399) 1.268795 / 4.584777 (-3.315982) 5.777745 / 3.745712 (2.032033) 3.232551 / 5.269862 (-2.037311) 2.127699 / 4.565676 (-2.437977) 0.146570 / 0.424275 (-0.277705) 0.015971 / 0.007607 (0.008364) 0.803181 / 0.226044 (0.577137) 8.377192 / 2.268929 (6.108264) 3.551242 / 55.444624 (-51.893382) 2.865228 / 6.876477 (-4.011249) 2.774869 / 2.142072 (0.632797) 1.553856 / 4.805227 (-3.251371) 0.264510 / 6.500664 (-6.236154) 0.087918 / 0.075469 (0.012449)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.653396 / 1.841788 (-0.188391) 18.703863 / 8.074308 (10.629555) 22.067331 / 10.191392 (11.875939) 0.257424 / 0.680424 (-0.422999) 0.026448 / 0.534201 (-0.507753) 0.550100 / 0.579283 (-0.029183) 0.647296 / 0.434364 (0.212932) 0.657476 / 0.540337 (0.117138) 0.781119 / 1.386936 (-0.605817)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008889 / 0.011353 (-0.002464) 0.004563 / 0.011008 (-0.006445) 0.101627 / 0.038508 (0.063118) 0.030526 / 0.023109 (0.007417) 0.297175 / 0.275898 (0.021277) 0.368454 / 0.323480 (0.044974) 0.007246 / 0.007986 (-0.000740) 0.003565 / 0.004328 (-0.000763) 0.078644 / 0.004250 (0.074394) 0.038616 / 0.037052 (0.001564) 0.310521 / 0.258489 (0.052032) 0.348014 / 0.293841 (0.054173) 0.033463 / 0.128546 (-0.095083) 0.011544 / 0.075646 (-0.064102) 0.323281 / 0.419271 (-0.095990) 0.040187 / 0.043533 (-0.003346) 0.298015 / 0.255139 (0.042876) 0.326392 / 0.283200 (0.043193) 0.088730 / 0.141683 (-0.052952) 1.503387 / 1.452155 (0.051233) 1.548704 / 1.492716 (0.055988)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.185983 / 0.018006 (0.167977) 0.451889 / 0.000490 (0.451400) 0.001433 / 0.000200 (0.001233) 0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023396 / 0.037411 (-0.014015) 0.118236 / 0.014526 (0.103710) 0.124594 / 0.176557 (-0.051962) 0.159089 / 0.737135 (-0.578047) 0.129369 / 0.296338 (-0.166969)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.423161 / 0.215209 (0.207952) 4.228211 / 2.077655 (2.150556) 1.853862 / 1.504120 (0.349742) 1.649471 / 1.541195 (0.108276) 1.708631 / 1.468490 (0.240141) 0.697456 / 4.584777 (-3.887321) 3.473244 / 3.745712 (-0.272468) 1.942586 / 5.269862 (-3.327275) 1.291592 / 4.565676 (-3.274084) 0.082758 / 0.424275 (-0.341517) 0.012256 / 0.007607 (0.004649) 0.528355 / 0.226044 (0.302311) 5.277620 / 2.268929 (3.008691) 2.299604 / 55.444624 (-53.145020) 1.954940 / 6.876477 (-4.921537) 2.055543 / 2.142072 (-0.086529) 0.814723 / 4.805227 (-3.990505) 0.149937 / 6.500664 (-6.350727) 0.064529 / 0.075469 (-0.010941)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.266240 / 1.841788 (-0.575547) 14.144016 / 8.074308 (6.069708) 14.331733 / 10.191392 (4.140340) 0.138963 / 0.680424 (-0.541461) 0.029034 / 0.534201 (-0.505167) 0.397325 / 0.579283 (-0.181958) 0.405293 / 0.434364 (-0.029071) 0.480745 / 0.540337 (-0.059592) 0.573386 / 1.386936 (-0.813550)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007214 / 0.011353 (-0.004139) 0.004569 / 0.011008 (-0.006439) 0.078718 / 0.038508 (0.040209) 0.031104 / 0.023109 (0.007995) 0.342562 / 0.275898 (0.066664) 0.387802 / 0.323480 (0.064322) 0.005378 / 0.007986 (-0.002608) 0.003414 / 0.004328 (-0.000915) 0.077249 / 0.004250 (0.072999) 0.044337 / 0.037052 (0.007285) 0.341397 / 0.258489 (0.082907) 0.385536 / 0.293841 (0.091695) 0.033257 / 0.128546 (-0.095289) 0.011825 / 0.075646 (-0.063821) 0.086723 / 0.419271 (-0.332549) 0.045951 / 0.043533 (0.002418) 0.340914 / 0.255139 (0.085775) 0.367126 / 0.283200 (0.083926) 0.096326 / 0.141683 (-0.045357) 1.608612 / 1.452155 (0.156458) 1.687251 / 1.492716 (0.194534)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.227595 / 0.018006 (0.209589) 0.418502 / 0.000490 (0.418013) 0.000392 / 0.000200 (0.000192) 0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026232 / 0.037411 (-0.011179) 0.101020 / 0.014526 (0.086494) 0.110017 / 0.176557 (-0.066539) 0.153497 / 0.737135 (-0.583639) 0.110602 / 0.296338 (-0.185737)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.433789 / 0.215209 (0.218579) 4.329350 / 2.077655 (2.251696) 2.052136 / 1.504120 (0.548016) 1.848457 / 1.541195 (0.307262) 1.936791 / 1.468490 (0.468301) 0.700609 / 4.584777 (-3.884168) 3.391983 / 3.745712 (-0.353729) 1.903220 / 5.269862 (-3.366642) 1.179463 / 4.565676 (-3.386213) 0.084025 / 0.424275 (-0.340250) 0.012743 / 0.007607 (0.005136) 0.536816 / 0.226044 (0.310772) 5.420230 / 2.268929 (3.151302) 2.507438 / 55.444624 (-52.937187) 2.178907 / 6.876477 (-4.697570) 2.228586 / 2.142072 (0.086514) 0.812527 / 4.805227 (-3.992701) 0.153382 / 6.500664 (-6.347282) 0.069932 / 0.075469 (-0.005537)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.256861 / 1.841788 (-0.584927) 14.309236 / 8.074308 (6.234928) 13.740323 / 10.191392 (3.548931) 0.142698 / 0.680424 (-0.537726) 0.016998 / 0.534201 (-0.517203) 0.385489 / 0.579283 (-0.193794) 0.391515 / 0.434364 (-0.042849) 0.472704 / 0.540337 (-0.067633) 0.565042 / 1.386936 (-0.821894)

@mariosasko
Copy link
Collaborator Author

This is ready for review.

If verification_mode is None, it defaults to VerificationMode.BASIC instead of VerificationMode.NONE, so maybe we should find a better name for the latter to avoid confusion.

PS: ignore_verifications is still present in the test/run_beam commands for simplicity. Let me know if you think these commands should support all three modes.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good ! thanks 🙌

src/datasets/builder.py Outdated Show resolved Hide resolved
Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! I've left a couple of text suggestions and also a couple of questions.

I would also prefer to change the name for the NONE verification mode, but don't have really good ideas in mind. maybe smth like SKIP_ALL ?

src/datasets/builder.py Outdated Show resolved Hide resolved
src/datasets/utils/info_utils.py Outdated Show resolved Hide resolved
src/datasets/utils/info_utils.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
@@ -724,7 +740,7 @@ def download_and_prepare(
self._output_dir = fs_token_paths[2][0] if is_local else self._fs.unstrip_protocol(fs_token_paths[2][0])

download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS)
verify_infos = not ignore_verifications
verification_mode = VerificationMode(verification_mode or VerificationMode.BASIC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a stupid question idk but why not VerificationMode(verification_mode) or VerificationMode.BASIC (here and everywhere below)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both approaches work, so I guess this is a matter of style

src/datasets/commands/test.py Outdated Show resolved Hide resolved

| | Verification checks |
|--------------------|------------------------------------------------------------------------------ |
| `FULL` | Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is uniqueness of the keys check depended on the verification_mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please point me to the place in the code? I can't find it.... 🙈

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the commit in which this check was introduced: beca084

mariosasko and others added 2 commits February 10, 2023 15:33
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Polina Kazakova <polina@huggingface.co>
@mariosasko
Copy link
Collaborator Author

I would also prefer to change the name for the NONE verification mode, but don't have really good ideas in mind. maybe smth like SKIP_ALL ?

I decided to go with the following names:

  • no_checks (previously none)
  • basic_checks (previously basic)
  • all_checks (previously full)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !

src/datasets/load.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
src/datasets/utils/info_utils.py Outdated Show resolved Hide resolved
tests/test_builder.py Outdated Show resolved Hide resolved
mariosasko and others added 3 commits February 13, 2023 17:11
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008900 / 0.011353 (-0.002453) 0.004492 / 0.011008 (-0.006516) 0.100957 / 0.038508 (0.062449) 0.030145 / 0.023109 (0.007036) 0.302531 / 0.275898 (0.026633) 0.344072 / 0.323480 (0.020592) 0.007032 / 0.007986 (-0.000953) 0.004150 / 0.004328 (-0.000178) 0.078272 / 0.004250 (0.074021) 0.034142 / 0.037052 (-0.002910) 0.310798 / 0.258489 (0.052308) 0.350077 / 0.293841 (0.056236) 0.034497 / 0.128546 (-0.094050) 0.011417 / 0.075646 (-0.064230) 0.323427 / 0.419271 (-0.095844) 0.045664 / 0.043533 (0.002132) 0.304688 / 0.255139 (0.049549) 0.336591 / 0.283200 (0.053391) 0.086116 / 0.141683 (-0.055567) 1.519278 / 1.452155 (0.067123) 1.576728 / 1.492716 (0.084011)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.242482 / 0.018006 (0.224476) 0.403548 / 0.000490 (0.403058) 0.001217 / 0.000200 (0.001017) 0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023466 / 0.037411 (-0.013945) 0.095220 / 0.014526 (0.080694) 0.104119 / 0.176557 (-0.072438) 0.141107 / 0.737135 (-0.596029) 0.107236 / 0.296338 (-0.189102)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.416290 / 0.215209 (0.201081) 4.159068 / 2.077655 (2.081413) 1.846014 / 1.504120 (0.341894) 1.634789 / 1.541195 (0.093594) 1.724687 / 1.468490 (0.256196) 0.696887 / 4.584777 (-3.887890) 3.313861 / 3.745712 (-0.431851) 1.907239 / 5.269862 (-3.362622) 1.266815 / 4.565676 (-3.298861) 0.081660 / 0.424275 (-0.342615) 0.012290 / 0.007607 (0.004683) 0.522866 / 0.226044 (0.296822) 5.237356 / 2.268929 (2.968428) 2.294645 / 55.444624 (-53.149979) 1.946407 / 6.876477 (-4.930069) 1.995441 / 2.142072 (-0.146632) 0.808340 / 4.805227 (-3.996887) 0.149670 / 6.500664 (-6.350994) 0.065162 / 0.075469 (-0.010307)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.219476 / 1.841788 (-0.622312) 13.868709 / 8.074308 (5.794401) 14.115783 / 10.191392 (3.924391) 0.149403 / 0.680424 (-0.531021) 0.028514 / 0.534201 (-0.505686) 0.398194 / 0.579283 (-0.181089) 0.410898 / 0.434364 (-0.023466) 0.485763 / 0.540337 (-0.054574) 0.574924 / 1.386936 (-0.812012)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006906 / 0.011353 (-0.004447) 0.004446 / 0.011008 (-0.006562) 0.075936 / 0.038508 (0.037428) 0.027693 / 0.023109 (0.004584) 0.339505 / 0.275898 (0.063607) 0.383315 / 0.323480 (0.059835) 0.005138 / 0.007986 (-0.002847) 0.004636 / 0.004328 (0.000308) 0.074829 / 0.004250 (0.070578) 0.040327 / 0.037052 (0.003274) 0.340516 / 0.258489 (0.082027) 0.388569 / 0.293841 (0.094729) 0.031562 / 0.128546 (-0.096984) 0.011585 / 0.075646 (-0.064061) 0.084753 / 0.419271 (-0.334518) 0.041310 / 0.043533 (-0.002223) 0.338272 / 0.255139 (0.083133) 0.367243 / 0.283200 (0.084043) 0.092653 / 0.141683 (-0.049029) 1.515973 / 1.452155 (0.063818) 1.582869 / 1.492716 (0.090152)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.229366 / 0.018006 (0.211360) 0.414404 / 0.000490 (0.413914) 0.002922 / 0.000200 (0.002723) 0.000075 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026391 / 0.037411 (-0.011020) 0.106754 / 0.014526 (0.092228) 0.110718 / 0.176557 (-0.065839) 0.145786 / 0.737135 (-0.591350) 0.113180 / 0.296338 (-0.183159)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.446340 / 0.215209 (0.231131) 4.499756 / 2.077655 (2.422101) 2.071485 / 1.504120 (0.567365) 1.873223 / 1.541195 (0.332029) 1.931562 / 1.468490 (0.463071) 0.699270 / 4.584777 (-3.885507) 3.452383 / 3.745712 (-0.293329) 2.970630 / 5.269862 (-2.299232) 1.300859 / 4.565676 (-3.264817) 0.083971 / 0.424275 (-0.340304) 0.012489 / 0.007607 (0.004882) 0.544190 / 0.226044 (0.318146) 5.460097 / 2.268929 (3.191169) 2.700244 / 55.444624 (-52.744380) 2.396694 / 6.876477 (-4.479783) 2.376334 / 2.142072 (0.234262) 0.812845 / 4.805227 (-3.992382) 0.154441 / 6.500664 (-6.346223) 0.069510 / 0.075469 (-0.005959)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.278836 / 1.841788 (-0.562952) 14.153158 / 8.074308 (6.078850) 13.821290 / 10.191392 (3.629898) 0.160464 / 0.680424 (-0.519960) 0.016742 / 0.534201 (-0.517459) 0.379840 / 0.579283 (-0.199443) 0.391903 / 0.434364 (-0.042461) 0.461646 / 0.540337 (-0.078691) 0.550691 / 1.386936 (-0.836245)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009858 / 0.011353 (-0.001495) 0.005383 / 0.011008 (-0.005625) 0.100527 / 0.038508 (0.062019) 0.037176 / 0.023109 (0.014067) 0.295204 / 0.275898 (0.019306) 0.364511 / 0.323480 (0.041031) 0.008486 / 0.007986 (0.000500) 0.004273 / 0.004328 (-0.000055) 0.076538 / 0.004250 (0.072288) 0.046250 / 0.037052 (0.009197) 0.307102 / 0.258489 (0.048613) 0.339313 / 0.293841 (0.045472) 0.040783 / 0.128546 (-0.087763) 0.012323 / 0.075646 (-0.063323) 0.336216 / 0.419271 (-0.083055) 0.050480 / 0.043533 (0.006947) 0.293689 / 0.255139 (0.038550) 0.315034 / 0.283200 (0.031834) 0.113775 / 0.141683 (-0.027908) 1.438738 / 1.452155 (-0.013416) 1.499874 / 1.492716 (0.007157)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.202392 / 0.018006 (0.184386) 0.442784 / 0.000490 (0.442295) 0.003004 / 0.000200 (0.002804) 0.000087 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027792 / 0.037411 (-0.009620) 0.110886 / 0.014526 (0.096360) 0.121041 / 0.176557 (-0.055515) 0.166803 / 0.737135 (-0.570333) 0.127617 / 0.296338 (-0.168722)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.409762 / 0.215209 (0.194553) 4.073297 / 2.077655 (1.995643) 1.836375 / 1.504120 (0.332255) 1.651507 / 1.541195 (0.110312) 1.734134 / 1.468490 (0.265644) 0.690900 / 4.584777 (-3.893877) 3.812045 / 3.745712 (0.066333) 2.101378 / 5.269862 (-3.168483) 1.438242 / 4.565676 (-3.127434) 0.083256 / 0.424275 (-0.341020) 0.012436 / 0.007607 (0.004829) 0.501702 / 0.226044 (0.275658) 5.007679 / 2.268929 (2.738751) 2.315158 / 55.444624 (-53.129466) 2.003934 / 6.876477 (-4.872543) 2.154658 / 2.142072 (0.012586) 0.831749 / 4.805227 (-3.973478) 0.165058 / 6.500664 (-6.335606) 0.062166 / 0.075469 (-0.013303)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.212435 / 1.841788 (-0.629353) 15.022673 / 8.074308 (6.948365) 14.649631 / 10.191392 (4.458239) 0.172121 / 0.680424 (-0.508303) 0.028791 / 0.534201 (-0.505410) 0.440290 / 0.579283 (-0.138993) 0.437359 / 0.434364 (0.002995) 0.543603 / 0.540337 (0.003265) 0.643241 / 1.386936 (-0.743695)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007572 / 0.011353 (-0.003781) 0.005207 / 0.011008 (-0.005801) 0.074427 / 0.038508 (0.035919) 0.033384 / 0.023109 (0.010275) 0.334538 / 0.275898 (0.058640) 0.371556 / 0.323480 (0.048076) 0.006453 / 0.007986 (-0.001532) 0.004010 / 0.004328 (-0.000319) 0.073488 / 0.004250 (0.069238) 0.048082 / 0.037052 (0.011030) 0.337325 / 0.258489 (0.078836) 0.395143 / 0.293841 (0.101302) 0.036714 / 0.128546 (-0.091832) 0.012089 / 0.075646 (-0.063557) 0.086008 / 0.419271 (-0.333263) 0.049277 / 0.043533 (0.005744) 0.333848 / 0.255139 (0.078709) 0.354003 / 0.283200 (0.070803) 0.105012 / 0.141683 (-0.036671) 1.450769 / 1.452155 (-0.001386) 1.554538 / 1.492716 (0.061821)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.208407 / 0.018006 (0.190400) 0.438778 / 0.000490 (0.438288) 0.000399 / 0.000200 (0.000199) 0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030180 / 0.037411 (-0.007232) 0.115432 / 0.014526 (0.100906) 0.126106 / 0.176557 (-0.050451) 0.167508 / 0.737135 (-0.569627) 0.130566 / 0.296338 (-0.165772)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.421408 / 0.215209 (0.206198) 4.208492 / 2.077655 (2.130838) 2.024177 / 1.504120 (0.520057) 1.834356 / 1.541195 (0.293161) 1.923234 / 1.468490 (0.454744) 0.699548 / 4.584777 (-3.885229) 3.933775 / 3.745712 (0.188063) 2.124526 / 5.269862 (-3.145336) 1.360934 / 4.565676 (-3.204742) 0.086568 / 0.424275 (-0.337707) 0.012351 / 0.007607 (0.004744) 0.517431 / 0.226044 (0.291387) 5.175428 / 2.268929 (2.906499) 2.471031 / 55.444624 (-52.973593) 2.131529 / 6.876477 (-4.744948) 2.202512 / 2.142072 (0.060440) 0.849364 / 4.805227 (-3.955863) 0.171505 / 6.500664 (-6.329159) 0.065864 / 0.075469 (-0.009605)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.270054 / 1.841788 (-0.571734) 15.254502 / 8.074308 (7.180194) 13.874969 / 10.191392 (3.683577) 0.144131 / 0.680424 (-0.536293) 0.017743 / 0.534201 (-0.516458) 0.421990 / 0.579283 (-0.157293) 0.423924 / 0.434364 (-0.010439) 0.522560 / 0.540337 (-0.017778) 0.626159 / 1.386936 (-0.760777)

@mariosasko mariosasko merged commit cc637d1 into main Feb 13, 2023
@mariosasko mariosasko deleted the skip-verifications branch February 13, 2023 16:43
@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008643 / 0.011353 (-0.002710) 0.004479 / 0.011008 (-0.006529) 0.102372 / 0.038508 (0.063864) 0.029703 / 0.023109 (0.006594) 0.301479 / 0.275898 (0.025581) 0.370970 / 0.323480 (0.047490) 0.007044 / 0.007986 (-0.000942) 0.004868 / 0.004328 (0.000540) 0.079568 / 0.004250 (0.075318) 0.035344 / 0.037052 (-0.001708) 0.308091 / 0.258489 (0.049602) 0.353812 / 0.293841 (0.059971) 0.033406 / 0.128546 (-0.095140) 0.011476 / 0.075646 (-0.064170) 0.324343 / 0.419271 (-0.094929) 0.040293 / 0.043533 (-0.003240) 0.300007 / 0.255139 (0.044868) 0.334410 / 0.283200 (0.051210) 0.086553 / 0.141683 (-0.055130) 1.463814 / 1.452155 (0.011659) 1.501580 / 1.492716 (0.008864)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.198032 / 0.018006 (0.180025) 0.409970 / 0.000490 (0.409480) 0.001075 / 0.000200 (0.000875) 0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022941 / 0.037411 (-0.014471) 0.097320 / 0.014526 (0.082794) 0.106445 / 0.176557 (-0.070111) 0.139073 / 0.737135 (-0.598063) 0.108408 / 0.296338 (-0.187930)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.419315 / 0.215209 (0.204106) 4.199273 / 2.077655 (2.121618) 1.877689 / 1.504120 (0.373569) 1.670442 / 1.541195 (0.129247) 1.735034 / 1.468490 (0.266544) 0.694691 / 4.584777 (-3.890086) 3.323644 / 3.745712 (-0.422069) 2.884349 / 5.269862 (-2.385513) 1.518882 / 4.565676 (-3.046794) 0.082390 / 0.424275 (-0.341886) 0.012884 / 0.007607 (0.005277) 0.525103 / 0.226044 (0.299058) 5.277297 / 2.268929 (3.008369) 2.328639 / 55.444624 (-53.115985) 1.983210 / 6.876477 (-4.893267) 2.037985 / 2.142072 (-0.104088) 0.809520 / 4.805227 (-3.995707) 0.150150 / 6.500664 (-6.350514) 0.065578 / 0.075469 (-0.009891)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.221971 / 1.841788 (-0.619817) 13.692361 / 8.074308 (5.618052) 13.874582 / 10.191392 (3.683190) 0.138182 / 0.680424 (-0.542242) 0.028618 / 0.534201 (-0.505583) 0.395104 / 0.579283 (-0.184179) 0.397169 / 0.434364 (-0.037195) 0.457509 / 0.540337 (-0.082829) 0.537275 / 1.386936 (-0.849661)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006835 / 0.011353 (-0.004518) 0.004585 / 0.011008 (-0.006423) 0.076877 / 0.038508 (0.038369) 0.027305 / 0.023109 (0.004196) 0.349085 / 0.275898 (0.073187) 0.401416 / 0.323480 (0.077936) 0.004912 / 0.007986 (-0.003074) 0.003315 / 0.004328 (-0.001014) 0.075676 / 0.004250 (0.071425) 0.038960 / 0.037052 (0.001907) 0.346196 / 0.258489 (0.087707) 0.403185 / 0.293841 (0.109344) 0.032054 / 0.128546 (-0.096493) 0.011742 / 0.075646 (-0.063905) 0.086631 / 0.419271 (-0.332640) 0.041633 / 0.043533 (-0.001900) 0.343519 / 0.255139 (0.088380) 0.385413 / 0.283200 (0.102213) 0.091430 / 0.141683 (-0.050253) 1.478886 / 1.452155 (0.026731) 1.546873 / 1.492716 (0.054156)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.167882 / 0.018006 (0.149876) 0.396464 / 0.000490 (0.395974) 0.003629 / 0.000200 (0.003429) 0.000085 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024829 / 0.037411 (-0.012583) 0.099607 / 0.014526 (0.085081) 0.106187 / 0.176557 (-0.070370) 0.142379 / 0.737135 (-0.594756) 0.109307 / 0.296338 (-0.187032)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.442276 / 0.215209 (0.227067) 4.427099 / 2.077655 (2.349444) 2.093407 / 1.504120 (0.589287) 1.880973 / 1.541195 (0.339778) 1.915592 / 1.468490 (0.447102) 0.708196 / 4.584777 (-3.876581) 3.417649 / 3.745712 (-0.328063) 2.859953 / 5.269862 (-2.409909) 1.528380 / 4.565676 (-3.037297) 0.084054 / 0.424275 (-0.340221) 0.012585 / 0.007607 (0.004978) 0.537614 / 0.226044 (0.311569) 5.409915 / 2.268929 (3.140987) 2.555853 / 55.444624 (-52.888771) 2.195075 / 6.876477 (-4.681402) 2.232775 / 2.142072 (0.090703) 0.814994 / 4.805227 (-3.990233) 0.152882 / 6.500664 (-6.347782) 0.067467 / 0.075469 (-0.008002)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.306007 / 1.841788 (-0.535780) 13.923981 / 8.074308 (5.849673) 13.385881 / 10.191392 (3.194489) 0.150712 / 0.680424 (-0.529712) 0.016731 / 0.534201 (-0.517470) 0.376557 / 0.579283 (-0.202726) 0.379396 / 0.434364 (-0.054968) 0.456251 / 0.540337 (-0.084087) 0.545731 / 1.386936 (-0.841205)

AJDERS pushed a commit to AJDERS/datasets that referenced this pull request Feb 15, 2023
* Skip dataset verifications by default

* Replace ignore_verifications with VerificationMode

* Update io builders

* Update commands

* Update tests

* Update docs

* Misc

* Fix bad copying in generator io builder

* Update src/datasets/builder.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Polina Kazakova <polina@huggingface.co>

* Rename values

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Style

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Polina Kazakova <polina@huggingface.co>
filip-halt pushed a commit to filip-halt/datasets that referenced this pull request Feb 16, 2023
* Skip dataset verifications by default

* Replace ignore_verifications with VerificationMode

* Update io builders

* Update commands

* Update tests

* Update docs

* Misc

* Fix bad copying in generator io builder

* Update src/datasets/builder.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Polina Kazakova <polina@huggingface.co>

* Rename values

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Style

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Polina Kazakova <polina@huggingface.co>
filip-halt pushed a commit to filip-halt/datasets that referenced this pull request Feb 16, 2023
* Skip dataset verifications by default

* Replace ignore_verifications with VerificationMode

* Update io builders

* Update commands

* Update tests

* Update docs

* Misc

* Fix bad copying in generator io builder

* Update src/datasets/builder.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Polina Kazakova <polina@huggingface.co>

* Rename values

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Style

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Polina Kazakova <polina@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants