Skip dataset verifications by default #5303

mariosasko · 2022-11-25T18:39:09Z

Skip the dataset verifications (split and checksum verifications, duplicate keys check) by default unless a dataset is being tested (datasets-cli test/run_beam). The main goal is to avoid running the checksum check in the default case due to how expensive it can be for large datasets.

PS: Maybe we should deprecate ignore_verifications, which is True now by default, and give it a different name?

HuggingFaceDocBuilderDev · 2022-11-25T18:44:23Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-12-07T11:59:33Z

100% agree that the checksum verification is overkill and not super useful. But I think this PR would also disable the check on num_examples no ?

As a user I would like to know if the dataset I'm loading changed significantly.
And I also think it can be useful to make sure the metadata are up to date.

What do you think ?

We could have a default ignore_verifications="ignore_checksums"

mariosasko · 2022-12-07T13:45:38Z

We could have a default ignore_verifications="ignore_checksums"

Accepting multiple types (booleans and strings) at the same time is not the best design. Maybe we could define an enum for this parameter?

lhoestq · 2022-12-07T14:11:31Z

Yes an enum sounds good !

polinaeterna · 2023-01-23T18:53:31Z

so we can have three verification levels, - smth like "ignore_all" (to skip both checksums and all other info like num_examples verification), "ignore_checksums" (to skip only checksums verification), and "verify_all" (to perform all verification)?
and deprecate ignore_verifications param.

@mariosasko if you're not going to work on this PR in the coming days, I can take over it if you want (this PR will help me with this issue, not super urgent though).

mariosasko · 2023-01-24T16:47:59Z

Okay, I propose deprecating ignore_verifications in favor of verification_mode (load_dataset already has download_mode; some other projects use this name for verification control). verification_mode would accept the following enum (or strings in the same manner as download_mode does):

class VerificationMode(enum.Enum):
    FULL = "full"           # runs all verification checks 
    BASIC = "basic"     # default, runs only the cheap ones (skips the checksum check)
    NONE = "none"      # skips all the checks

WDTY?

lhoestq · 2023-01-24T16:55:13Z

(copy paste from my message on slack)

What do you think of a config variable in config.py to switch from one verification mode to another ? This way we don’t deprecate anything

Many users are familiar with ignore_verifications=True, it might be overkill to deprecate it

polinaeterna · 2023-01-25T12:33:06Z

@lhoestq So we have "basic" verification mode in config.py and continue to have False as a default
value for ignore_verifications? That way running all verifications including checksums would not be possible without switching the config var, right?

I like having a VerificationMode enum because it's aligned with DownloadMode and sounds more natural to me (ignore_verifications feels a bit semantically reverted but this is probably just my feeling) and it's flexible (no need to worry about config.py, I'm not sure that users even know it exists, wdyt?).

The usage point seems also valid to me, but cases when users are stuck with NonMatchingX errors also happen from time to time and to figure out what's wrong is non-trivial here.

As a note aside - I suggest to add instructions to the NonMatchingX error message (how to use ignore_verifications / verification_mode), this would save users who don't know about this param a lot of time.

lhoestq · 2023-01-25T15:23:45Z

Ok I see. I'm fine with the new parameter then (even though I had a small pref for the config variable) :)

albertvillanova · 2023-01-26T09:52:55Z

I like the idea of an enum and the verification_mode parameter.

In relation with the config parameter, we could additionally add a DEFAULT_VERIFICATION_MODE, maybe only if users require it. Note that until now there wasn't any config parameter for a default ignore_verifications value: I guess people are explicitly passing ignore_verifications=True...

As a note aside, I like the suggestion by @polinaeterna: we could give actionable messages when verifying checksums. This could be done in other PR.

…fications

github-actions · 2023-01-27T15:38:27Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.012891 / 0.011353 (0.001538)	0.006474 / 0.011008 (-0.004535)	0.144038 / 0.038508 (0.105530)	0.036151 / 0.023109 (0.013042)	0.404366 / 0.275898 (0.128468)	0.479988 / 0.323480 (0.156508)	0.010219 / 0.007986 (0.002233)	0.005319 / 0.004328 (0.000990)	0.099705 / 0.004250 (0.095455)	0.046639 / 0.037052 (0.009586)	0.398997 / 0.258489 (0.140508)	0.478431 / 0.293841 (0.184590)	0.069125 / 0.128546 (-0.059421)	0.019603 / 0.075646 (-0.056043)	0.400829 / 0.419271 (-0.018443)	0.066549 / 0.043533 (0.023016)	0.398343 / 0.255139 (0.143204)	0.417928 / 0.283200 (0.134728)	0.121124 / 0.141683 (-0.020559)	1.751513 / 1.452155 (0.299358)	1.821239 / 1.492716 (0.328523)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.251603 / 0.018006 (0.233597)	0.579916 / 0.000490 (0.579427)	0.003257 / 0.000200 (0.003058)	0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031502 / 0.037411 (-0.005909)	0.134688 / 0.014526 (0.120162)	0.152306 / 0.176557 (-0.024251)	0.198943 / 0.737135 (-0.538192)	0.142551 / 0.296338 (-0.153788)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.634672 / 0.215209 (0.419463)	6.370215 / 2.077655 (4.292561)	2.548123 / 1.504120 (1.044003)	2.184263 / 1.541195 (0.643069)	2.239026 / 1.468490 (0.770536)	1.233340 / 4.584777 (-3.351437)	5.791824 / 3.745712 (2.046112)	5.093032 / 5.269862 (-0.176830)	2.849833 / 4.565676 (-1.715844)	0.143787 / 0.424275 (-0.280488)	0.015279 / 0.007607 (0.007672)	0.757984 / 0.226044 (0.531939)	7.883604 / 2.268929 (5.614675)	3.321591 / 55.444624 (-52.123033)	2.671777 / 6.876477 (-4.204700)	2.685215 / 2.142072 (0.543142)	1.546709 / 4.805227 (-3.258519)	0.247186 / 6.500664 (-6.253478)	0.085117 / 0.075469 (0.009648)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.679809 / 1.841788 (-0.161979)	18.528893 / 8.074308 (10.454585)	23.168590 / 10.191392 (12.977198)	0.277618 / 0.680424 (-0.402806)	0.045109 / 0.534201 (-0.489092)	0.568873 / 0.579283 (-0.010410)	0.695017 / 0.434364 (0.260653)	0.671024 / 0.540337 (0.130687)	0.823817 / 1.386936 (-0.563119)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009809 / 0.011353 (-0.001544)	0.006890 / 0.011008 (-0.004118)	0.099211 / 0.038508 (0.060703)	0.035387 / 0.023109 (0.012278)	0.507603 / 0.275898 (0.231705)	0.535553 / 0.323480 (0.212073)	0.007346 / 0.007986 (-0.000640)	0.007559 / 0.004328 (0.003231)	0.099132 / 0.004250 (0.094882)	0.048048 / 0.037052 (0.010996)	0.518096 / 0.258489 (0.259607)	0.561134 / 0.293841 (0.267294)	0.057580 / 0.128546 (-0.070966)	0.023665 / 0.075646 (-0.051982)	0.138409 / 0.419271 (-0.280862)	0.061989 / 0.043533 (0.018456)	0.510568 / 0.255139 (0.255429)	0.552722 / 0.283200 (0.269522)	0.115990 / 0.141683 (-0.025693)	1.884900 / 1.452155 (0.432745)	1.990604 / 1.492716 (0.497888)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.280638 / 0.018006 (0.262632)	0.592837 / 0.000490 (0.592347)	0.000465 / 0.000200 (0.000265)	0.000078 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030253 / 0.037411 (-0.007158)	0.141580 / 0.014526 (0.127054)	0.135114 / 0.176557 (-0.041443)	0.190003 / 0.737135 (-0.547133)	0.160230 / 0.296338 (-0.136109)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.699762 / 0.215209 (0.484553)	6.632344 / 2.077655 (4.554689)	2.718803 / 1.504120 (1.214683)	2.485294 / 1.541195 (0.944099)	2.579889 / 1.468490 (1.111399)	1.268795 / 4.584777 (-3.315982)	5.777745 / 3.745712 (2.032033)	3.232551 / 5.269862 (-2.037311)	2.127699 / 4.565676 (-2.437977)	0.146570 / 0.424275 (-0.277705)	0.015971 / 0.007607 (0.008364)	0.803181 / 0.226044 (0.577137)	8.377192 / 2.268929 (6.108264)	3.551242 / 55.444624 (-51.893382)	2.865228 / 6.876477 (-4.011249)	2.774869 / 2.142072 (0.632797)	1.553856 / 4.805227 (-3.251371)	0.264510 / 6.500664 (-6.236154)	0.087918 / 0.075469 (0.012449)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.653396 / 1.841788 (-0.188391)	18.703863 / 8.074308 (10.629555)	22.067331 / 10.191392 (11.875939)	0.257424 / 0.680424 (-0.422999)	0.026448 / 0.534201 (-0.507753)	0.550100 / 0.579283 (-0.029183)	0.647296 / 0.434364 (0.212932)	0.657476 / 0.540337 (0.117138)	0.781119 / 1.386936 (-0.605817)

github-actions · 2023-01-27T16:05:37Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008889 / 0.011353 (-0.002464)	0.004563 / 0.011008 (-0.006445)	0.101627 / 0.038508 (0.063118)	0.030526 / 0.023109 (0.007417)	0.297175 / 0.275898 (0.021277)	0.368454 / 0.323480 (0.044974)	0.007246 / 0.007986 (-0.000740)	0.003565 / 0.004328 (-0.000763)	0.078644 / 0.004250 (0.074394)	0.038616 / 0.037052 (0.001564)	0.310521 / 0.258489 (0.052032)	0.348014 / 0.293841 (0.054173)	0.033463 / 0.128546 (-0.095083)	0.011544 / 0.075646 (-0.064102)	0.323281 / 0.419271 (-0.095990)	0.040187 / 0.043533 (-0.003346)	0.298015 / 0.255139 (0.042876)	0.326392 / 0.283200 (0.043193)	0.088730 / 0.141683 (-0.052952)	1.503387 / 1.452155 (0.051233)	1.548704 / 1.492716 (0.055988)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.185983 / 0.018006 (0.167977)	0.451889 / 0.000490 (0.451400)	0.001433 / 0.000200 (0.001233)	0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023396 / 0.037411 (-0.014015)	0.118236 / 0.014526 (0.103710)	0.124594 / 0.176557 (-0.051962)	0.159089 / 0.737135 (-0.578047)	0.129369 / 0.296338 (-0.166969)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.423161 / 0.215209 (0.207952)	4.228211 / 2.077655 (2.150556)	1.853862 / 1.504120 (0.349742)	1.649471 / 1.541195 (0.108276)	1.708631 / 1.468490 (0.240141)	0.697456 / 4.584777 (-3.887321)	3.473244 / 3.745712 (-0.272468)	1.942586 / 5.269862 (-3.327275)	1.291592 / 4.565676 (-3.274084)	0.082758 / 0.424275 (-0.341517)	0.012256 / 0.007607 (0.004649)	0.528355 / 0.226044 (0.302311)	5.277620 / 2.268929 (3.008691)	2.299604 / 55.444624 (-53.145020)	1.954940 / 6.876477 (-4.921537)	2.055543 / 2.142072 (-0.086529)	0.814723 / 4.805227 (-3.990505)	0.149937 / 6.500664 (-6.350727)	0.064529 / 0.075469 (-0.010941)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.266240 / 1.841788 (-0.575547)	14.144016 / 8.074308 (6.069708)	14.331733 / 10.191392 (4.140340)	0.138963 / 0.680424 (-0.541461)	0.029034 / 0.534201 (-0.505167)	0.397325 / 0.579283 (-0.181958)	0.405293 / 0.434364 (-0.029071)	0.480745 / 0.540337 (-0.059592)	0.573386 / 1.386936 (-0.813550)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007214 / 0.011353 (-0.004139)	0.004569 / 0.011008 (-0.006439)	0.078718 / 0.038508 (0.040209)	0.031104 / 0.023109 (0.007995)	0.342562 / 0.275898 (0.066664)	0.387802 / 0.323480 (0.064322)	0.005378 / 0.007986 (-0.002608)	0.003414 / 0.004328 (-0.000915)	0.077249 / 0.004250 (0.072999)	0.044337 / 0.037052 (0.007285)	0.341397 / 0.258489 (0.082907)	0.385536 / 0.293841 (0.091695)	0.033257 / 0.128546 (-0.095289)	0.011825 / 0.075646 (-0.063821)	0.086723 / 0.419271 (-0.332549)	0.045951 / 0.043533 (0.002418)	0.340914 / 0.255139 (0.085775)	0.367126 / 0.283200 (0.083926)	0.096326 / 0.141683 (-0.045357)	1.608612 / 1.452155 (0.156458)	1.687251 / 1.492716 (0.194534)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227595 / 0.018006 (0.209589)	0.418502 / 0.000490 (0.418013)	0.000392 / 0.000200 (0.000192)	0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026232 / 0.037411 (-0.011179)	0.101020 / 0.014526 (0.086494)	0.110017 / 0.176557 (-0.066539)	0.153497 / 0.737135 (-0.583639)	0.110602 / 0.296338 (-0.185737)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.433789 / 0.215209 (0.218579)	4.329350 / 2.077655 (2.251696)	2.052136 / 1.504120 (0.548016)	1.848457 / 1.541195 (0.307262)	1.936791 / 1.468490 (0.468301)	0.700609 / 4.584777 (-3.884168)	3.391983 / 3.745712 (-0.353729)	1.903220 / 5.269862 (-3.366642)	1.179463 / 4.565676 (-3.386213)	0.084025 / 0.424275 (-0.340250)	0.012743 / 0.007607 (0.005136)	0.536816 / 0.226044 (0.310772)	5.420230 / 2.268929 (3.151302)	2.507438 / 55.444624 (-52.937187)	2.178907 / 6.876477 (-4.697570)	2.228586 / 2.142072 (0.086514)	0.812527 / 4.805227 (-3.992701)	0.153382 / 6.500664 (-6.347282)	0.069932 / 0.075469 (-0.005537)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.256861 / 1.841788 (-0.584927)	14.309236 / 8.074308 (6.234928)	13.740323 / 10.191392 (3.548931)	0.142698 / 0.680424 (-0.537726)	0.016998 / 0.534201 (-0.517203)	0.385489 / 0.579283 (-0.193794)	0.391515 / 0.434364 (-0.042849)	0.472704 / 0.540337 (-0.067633)	0.565042 / 1.386936 (-0.821894)

mariosasko · 2023-01-30T16:40:15Z

This is ready for review.

If verification_mode is None, it defaults to VerificationMode.BASIC instead of VerificationMode.NONE, so maybe we should find a better name for the latter to avoid confusion.

PS: ignore_verifications is still present in the test/run_beam commands for simplicity. Let me know if you think these commands should support all three modes.

lhoestq

Looks all good ! thanks 🙌

src/datasets/builder.py

polinaeterna

thank you! I've left a couple of text suggestions and also a couple of questions.

I would also prefer to change the name for the NONE verification mode, but don't have really good ideas in mind. maybe smth like SKIP_ALL ?

src/datasets/builder.py

src/datasets/utils/info_utils.py

src/datasets/load.py

polinaeterna · 2023-02-01T14:04:15Z

src/datasets/builder.py

@@ -724,7 +740,7 @@ def download_and_prepare(
        self._output_dir = fs_token_paths[2][0] if is_local else self._fs.unstrip_protocol(fs_token_paths[2][0])

        download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS)
-        verify_infos = not ignore_verifications
+        verification_mode = VerificationMode(verification_mode or VerificationMode.BASIC)


maybe a stupid question idk but why not VerificationMode(verification_mode) or VerificationMode.BASIC (here and everywhere below)?

Both approaches work, so I guess this is a matter of style

src/datasets/commands/test.py

polinaeterna · 2023-02-01T14:21:17Z

src/datasets/utils/info_utils.py

+
+    |                    | Verification checks                                                           |
+    |--------------------|------------------------------------------------------------------------------ |
+    | `FULL`             | Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder  |


is uniqueness of the keys check depended on the verification_mode?

Yes, it does

can you please point me to the place in the code? I can't find it.... 🙈

This is the commit in which this check was introduced: beca084

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Co-authored-by: Polina Kazakova <polina@huggingface.co>

…fications

mariosasko · 2023-02-10T16:06:07Z

I would also prefer to change the name for the NONE verification mode, but don't have really good ideas in mind. maybe smth like SKIP_ALL ?

I decided to go with the following names:

no_checks (previously none)
basic_checks (previously basic)
all_checks (previously full)

lhoestq

Thank you !

src/datasets/load.py

src/datasets/utils/info_utils.py

tests/test_builder.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…fications

github-actions · 2023-02-13T16:19:21Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008900 / 0.011353 (-0.002453)	0.004492 / 0.011008 (-0.006516)	0.100957 / 0.038508 (0.062449)	0.030145 / 0.023109 (0.007036)	0.302531 / 0.275898 (0.026633)	0.344072 / 0.323480 (0.020592)	0.007032 / 0.007986 (-0.000953)	0.004150 / 0.004328 (-0.000178)	0.078272 / 0.004250 (0.074021)	0.034142 / 0.037052 (-0.002910)	0.310798 / 0.258489 (0.052308)	0.350077 / 0.293841 (0.056236)	0.034497 / 0.128546 (-0.094050)	0.011417 / 0.075646 (-0.064230)	0.323427 / 0.419271 (-0.095844)	0.045664 / 0.043533 (0.002132)	0.304688 / 0.255139 (0.049549)	0.336591 / 0.283200 (0.053391)	0.086116 / 0.141683 (-0.055567)	1.519278 / 1.452155 (0.067123)	1.576728 / 1.492716 (0.084011)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242482 / 0.018006 (0.224476)	0.403548 / 0.000490 (0.403058)	0.001217 / 0.000200 (0.001017)	0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023466 / 0.037411 (-0.013945)	0.095220 / 0.014526 (0.080694)	0.104119 / 0.176557 (-0.072438)	0.141107 / 0.737135 (-0.596029)	0.107236 / 0.296338 (-0.189102)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.416290 / 0.215209 (0.201081)	4.159068 / 2.077655 (2.081413)	1.846014 / 1.504120 (0.341894)	1.634789 / 1.541195 (0.093594)	1.724687 / 1.468490 (0.256196)	0.696887 / 4.584777 (-3.887890)	3.313861 / 3.745712 (-0.431851)	1.907239 / 5.269862 (-3.362622)	1.266815 / 4.565676 (-3.298861)	0.081660 / 0.424275 (-0.342615)	0.012290 / 0.007607 (0.004683)	0.522866 / 0.226044 (0.296822)	5.237356 / 2.268929 (2.968428)	2.294645 / 55.444624 (-53.149979)	1.946407 / 6.876477 (-4.930069)	1.995441 / 2.142072 (-0.146632)	0.808340 / 4.805227 (-3.996887)	0.149670 / 6.500664 (-6.350994)	0.065162 / 0.075469 (-0.010307)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.219476 / 1.841788 (-0.622312)	13.868709 / 8.074308 (5.794401)	14.115783 / 10.191392 (3.924391)	0.149403 / 0.680424 (-0.531021)	0.028514 / 0.534201 (-0.505686)	0.398194 / 0.579283 (-0.181089)	0.410898 / 0.434364 (-0.023466)	0.485763 / 0.540337 (-0.054574)	0.574924 / 1.386936 (-0.812012)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006906 / 0.011353 (-0.004447)	0.004446 / 0.011008 (-0.006562)	0.075936 / 0.038508 (0.037428)	0.027693 / 0.023109 (0.004584)	0.339505 / 0.275898 (0.063607)	0.383315 / 0.323480 (0.059835)	0.005138 / 0.007986 (-0.002847)	0.004636 / 0.004328 (0.000308)	0.074829 / 0.004250 (0.070578)	0.040327 / 0.037052 (0.003274)	0.340516 / 0.258489 (0.082027)	0.388569 / 0.293841 (0.094729)	0.031562 / 0.128546 (-0.096984)	0.011585 / 0.075646 (-0.064061)	0.084753 / 0.419271 (-0.334518)	0.041310 / 0.043533 (-0.002223)	0.338272 / 0.255139 (0.083133)	0.367243 / 0.283200 (0.084043)	0.092653 / 0.141683 (-0.049029)	1.515973 / 1.452155 (0.063818)	1.582869 / 1.492716 (0.090152)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229366 / 0.018006 (0.211360)	0.414404 / 0.000490 (0.413914)	0.002922 / 0.000200 (0.002723)	0.000075 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026391 / 0.037411 (-0.011020)	0.106754 / 0.014526 (0.092228)	0.110718 / 0.176557 (-0.065839)	0.145786 / 0.737135 (-0.591350)	0.113180 / 0.296338 (-0.183159)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.446340 / 0.215209 (0.231131)	4.499756 / 2.077655 (2.422101)	2.071485 / 1.504120 (0.567365)	1.873223 / 1.541195 (0.332029)	1.931562 / 1.468490 (0.463071)	0.699270 / 4.584777 (-3.885507)	3.452383 / 3.745712 (-0.293329)	2.970630 / 5.269862 (-2.299232)	1.300859 / 4.565676 (-3.264817)	0.083971 / 0.424275 (-0.340304)	0.012489 / 0.007607 (0.004882)	0.544190 / 0.226044 (0.318146)	5.460097 / 2.268929 (3.191169)	2.700244 / 55.444624 (-52.744380)	2.396694 / 6.876477 (-4.479783)	2.376334 / 2.142072 (0.234262)	0.812845 / 4.805227 (-3.992382)	0.154441 / 6.500664 (-6.346223)	0.069510 / 0.075469 (-0.005959)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.278836 / 1.841788 (-0.562952)	14.153158 / 8.074308 (6.078850)	13.821290 / 10.191392 (3.629898)	0.160464 / 0.680424 (-0.519960)	0.016742 / 0.534201 (-0.517459)	0.379840 / 0.579283 (-0.199443)	0.391903 / 0.434364 (-0.042461)	0.461646 / 0.540337 (-0.078691)	0.550691 / 1.386936 (-0.836245)

github-actions · 2023-02-13T16:21:39Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009858 / 0.011353 (-0.001495)	0.005383 / 0.011008 (-0.005625)	0.100527 / 0.038508 (0.062019)	0.037176 / 0.023109 (0.014067)	0.295204 / 0.275898 (0.019306)	0.364511 / 0.323480 (0.041031)	0.008486 / 0.007986 (0.000500)	0.004273 / 0.004328 (-0.000055)	0.076538 / 0.004250 (0.072288)	0.046250 / 0.037052 (0.009197)	0.307102 / 0.258489 (0.048613)	0.339313 / 0.293841 (0.045472)	0.040783 / 0.128546 (-0.087763)	0.012323 / 0.075646 (-0.063323)	0.336216 / 0.419271 (-0.083055)	0.050480 / 0.043533 (0.006947)	0.293689 / 0.255139 (0.038550)	0.315034 / 0.283200 (0.031834)	0.113775 / 0.141683 (-0.027908)	1.438738 / 1.452155 (-0.013416)	1.499874 / 1.492716 (0.007157)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.202392 / 0.018006 (0.184386)	0.442784 / 0.000490 (0.442295)	0.003004 / 0.000200 (0.002804)	0.000087 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027792 / 0.037411 (-0.009620)	0.110886 / 0.014526 (0.096360)	0.121041 / 0.176557 (-0.055515)	0.166803 / 0.737135 (-0.570333)	0.127617 / 0.296338 (-0.168722)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.409762 / 0.215209 (0.194553)	4.073297 / 2.077655 (1.995643)	1.836375 / 1.504120 (0.332255)	1.651507 / 1.541195 (0.110312)	1.734134 / 1.468490 (0.265644)	0.690900 / 4.584777 (-3.893877)	3.812045 / 3.745712 (0.066333)	2.101378 / 5.269862 (-3.168483)	1.438242 / 4.565676 (-3.127434)	0.083256 / 0.424275 (-0.341020)	0.012436 / 0.007607 (0.004829)	0.501702 / 0.226044 (0.275658)	5.007679 / 2.268929 (2.738751)	2.315158 / 55.444624 (-53.129466)	2.003934 / 6.876477 (-4.872543)	2.154658 / 2.142072 (0.012586)	0.831749 / 4.805227 (-3.973478)	0.165058 / 6.500664 (-6.335606)	0.062166 / 0.075469 (-0.013303)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.212435 / 1.841788 (-0.629353)	15.022673 / 8.074308 (6.948365)	14.649631 / 10.191392 (4.458239)	0.172121 / 0.680424 (-0.508303)	0.028791 / 0.534201 (-0.505410)	0.440290 / 0.579283 (-0.138993)	0.437359 / 0.434364 (0.002995)	0.543603 / 0.540337 (0.003265)	0.643241 / 1.386936 (-0.743695)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007572 / 0.011353 (-0.003781)	0.005207 / 0.011008 (-0.005801)	0.074427 / 0.038508 (0.035919)	0.033384 / 0.023109 (0.010275)	0.334538 / 0.275898 (0.058640)	0.371556 / 0.323480 (0.048076)	0.006453 / 0.007986 (-0.001532)	0.004010 / 0.004328 (-0.000319)	0.073488 / 0.004250 (0.069238)	0.048082 / 0.037052 (0.011030)	0.337325 / 0.258489 (0.078836)	0.395143 / 0.293841 (0.101302)	0.036714 / 0.128546 (-0.091832)	0.012089 / 0.075646 (-0.063557)	0.086008 / 0.419271 (-0.333263)	0.049277 / 0.043533 (0.005744)	0.333848 / 0.255139 (0.078709)	0.354003 / 0.283200 (0.070803)	0.105012 / 0.141683 (-0.036671)	1.450769 / 1.452155 (-0.001386)	1.554538 / 1.492716 (0.061821)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208407 / 0.018006 (0.190400)	0.438778 / 0.000490 (0.438288)	0.000399 / 0.000200 (0.000199)	0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030180 / 0.037411 (-0.007232)	0.115432 / 0.014526 (0.100906)	0.126106 / 0.176557 (-0.050451)	0.167508 / 0.737135 (-0.569627)	0.130566 / 0.296338 (-0.165772)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.421408 / 0.215209 (0.206198)	4.208492 / 2.077655 (2.130838)	2.024177 / 1.504120 (0.520057)	1.834356 / 1.541195 (0.293161)	1.923234 / 1.468490 (0.454744)	0.699548 / 4.584777 (-3.885229)	3.933775 / 3.745712 (0.188063)	2.124526 / 5.269862 (-3.145336)	1.360934 / 4.565676 (-3.204742)	0.086568 / 0.424275 (-0.337707)	0.012351 / 0.007607 (0.004744)	0.517431 / 0.226044 (0.291387)	5.175428 / 2.268929 (2.906499)	2.471031 / 55.444624 (-52.973593)	2.131529 / 6.876477 (-4.744948)	2.202512 / 2.142072 (0.060440)	0.849364 / 4.805227 (-3.955863)	0.171505 / 6.500664 (-6.329159)	0.065864 / 0.075469 (-0.009605)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.270054 / 1.841788 (-0.571734)	15.254502 / 8.074308 (7.180194)	13.874969 / 10.191392 (3.683577)	0.144131 / 0.680424 (-0.536293)	0.017743 / 0.534201 (-0.516458)	0.421990 / 0.579283 (-0.157293)	0.423924 / 0.434364 (-0.010439)	0.522560 / 0.540337 (-0.017778)	0.626159 / 1.386936 (-0.760777)

github-actions · 2023-02-13T16:50:42Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008643 / 0.011353 (-0.002710)	0.004479 / 0.011008 (-0.006529)	0.102372 / 0.038508 (0.063864)	0.029703 / 0.023109 (0.006594)	0.301479 / 0.275898 (0.025581)	0.370970 / 0.323480 (0.047490)	0.007044 / 0.007986 (-0.000942)	0.004868 / 0.004328 (0.000540)	0.079568 / 0.004250 (0.075318)	0.035344 / 0.037052 (-0.001708)	0.308091 / 0.258489 (0.049602)	0.353812 / 0.293841 (0.059971)	0.033406 / 0.128546 (-0.095140)	0.011476 / 0.075646 (-0.064170)	0.324343 / 0.419271 (-0.094929)	0.040293 / 0.043533 (-0.003240)	0.300007 / 0.255139 (0.044868)	0.334410 / 0.283200 (0.051210)	0.086553 / 0.141683 (-0.055130)	1.463814 / 1.452155 (0.011659)	1.501580 / 1.492716 (0.008864)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.198032 / 0.018006 (0.180025)	0.409970 / 0.000490 (0.409480)	0.001075 / 0.000200 (0.000875)	0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022941 / 0.037411 (-0.014471)	0.097320 / 0.014526 (0.082794)	0.106445 / 0.176557 (-0.070111)	0.139073 / 0.737135 (-0.598063)	0.108408 / 0.296338 (-0.187930)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.419315 / 0.215209 (0.204106)	4.199273 / 2.077655 (2.121618)	1.877689 / 1.504120 (0.373569)	1.670442 / 1.541195 (0.129247)	1.735034 / 1.468490 (0.266544)	0.694691 / 4.584777 (-3.890086)	3.323644 / 3.745712 (-0.422069)	2.884349 / 5.269862 (-2.385513)	1.518882 / 4.565676 (-3.046794)	0.082390 / 0.424275 (-0.341886)	0.012884 / 0.007607 (0.005277)	0.525103 / 0.226044 (0.299058)	5.277297 / 2.268929 (3.008369)	2.328639 / 55.444624 (-53.115985)	1.983210 / 6.876477 (-4.893267)	2.037985 / 2.142072 (-0.104088)	0.809520 / 4.805227 (-3.995707)	0.150150 / 6.500664 (-6.350514)	0.065578 / 0.075469 (-0.009891)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.221971 / 1.841788 (-0.619817)	13.692361 / 8.074308 (5.618052)	13.874582 / 10.191392 (3.683190)	0.138182 / 0.680424 (-0.542242)	0.028618 / 0.534201 (-0.505583)	0.395104 / 0.579283 (-0.184179)	0.397169 / 0.434364 (-0.037195)	0.457509 / 0.540337 (-0.082829)	0.537275 / 1.386936 (-0.849661)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006835 / 0.011353 (-0.004518)	0.004585 / 0.011008 (-0.006423)	0.076877 / 0.038508 (0.038369)	0.027305 / 0.023109 (0.004196)	0.349085 / 0.275898 (0.073187)	0.401416 / 0.323480 (0.077936)	0.004912 / 0.007986 (-0.003074)	0.003315 / 0.004328 (-0.001014)	0.075676 / 0.004250 (0.071425)	0.038960 / 0.037052 (0.001907)	0.346196 / 0.258489 (0.087707)	0.403185 / 0.293841 (0.109344)	0.032054 / 0.128546 (-0.096493)	0.011742 / 0.075646 (-0.063905)	0.086631 / 0.419271 (-0.332640)	0.041633 / 0.043533 (-0.001900)	0.343519 / 0.255139 (0.088380)	0.385413 / 0.283200 (0.102213)	0.091430 / 0.141683 (-0.050253)	1.478886 / 1.452155 (0.026731)	1.546873 / 1.492716 (0.054156)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.167882 / 0.018006 (0.149876)	0.396464 / 0.000490 (0.395974)	0.003629 / 0.000200 (0.003429)	0.000085 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024829 / 0.037411 (-0.012583)	0.099607 / 0.014526 (0.085081)	0.106187 / 0.176557 (-0.070370)	0.142379 / 0.737135 (-0.594756)	0.109307 / 0.296338 (-0.187032)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.442276 / 0.215209 (0.227067)	4.427099 / 2.077655 (2.349444)	2.093407 / 1.504120 (0.589287)	1.880973 / 1.541195 (0.339778)	1.915592 / 1.468490 (0.447102)	0.708196 / 4.584777 (-3.876581)	3.417649 / 3.745712 (-0.328063)	2.859953 / 5.269862 (-2.409909)	1.528380 / 4.565676 (-3.037297)	0.084054 / 0.424275 (-0.340221)	0.012585 / 0.007607 (0.004978)	0.537614 / 0.226044 (0.311569)	5.409915 / 2.268929 (3.140987)	2.555853 / 55.444624 (-52.888771)	2.195075 / 6.876477 (-4.681402)	2.232775 / 2.142072 (0.090703)	0.814994 / 4.805227 (-3.990233)	0.152882 / 6.500664 (-6.347782)	0.067467 / 0.075469 (-0.008002)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.306007 / 1.841788 (-0.535780)	13.923981 / 8.074308 (5.849673)	13.385881 / 10.191392 (3.194489)	0.150712 / 0.680424 (-0.529712)	0.016731 / 0.534201 (-0.517470)	0.376557 / 0.579283 (-0.202726)	0.379396 / 0.434364 (-0.054968)	0.456251 / 0.540337 (-0.084087)	0.545731 / 1.386936 (-0.841205)

* Skip dataset verifications by default * Replace ignore_verifications with VerificationMode * Update io builders * Update commands * Update tests * Update docs * Misc * Fix bad copying in generator io builder * Update src/datasets/builder.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Polina Kazakova <polina@huggingface.co> * Rename values * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Style --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Polina Kazakova <polina@huggingface.co>

Skip dataset verifications by default

517fd0e

mariosasko requested review from lhoestq and albertvillanova December 5, 2022 15:18

Merge conflict

a58d41d

mariosasko added 8 commits January 26, 2023 19:14

Merge branch 'main' of github.com:huggingface/datasets into skip-veri…

b04f11b

…fications

Replace ignore_verifications with VerificationMode

6daecbb

Update io builders

f2ca3e6

Update commands

e0d182f

Update tests

4cecbca

Update docs

578a61e

Misc

7eccd59

Merge branch 'main' of github.com:huggingface/datasets into skip-veri…

8c4a9cb

…fications

Fix bad copying in generator io builder

4b0713d

lhoestq approved these changes Jan 31, 2023

View reviewed changes

src/datasets/builder.py Outdated Show resolved Hide resolved

polinaeterna approved these changes Feb 1, 2023

View reviewed changes

mariosasko and others added 2 commits February 10, 2023 15:33

Update src/datasets/builder.py

3f489ee

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Apply suggestions from code review

c81eb8c

Co-authored-by: Polina Kazakova <polina@huggingface.co>

mariosasko added 2 commits February 10, 2023 15:53

Merge branch 'main' of github.com:huggingface/datasets into skip-veri…

525d413

…fications

Rename values

cd562b3

lhoestq approved these changes Feb 10, 2023

View reviewed changes

src/datasets/load.py Outdated Show resolved Hide resolved

src/datasets/load.py Outdated Show resolved Hide resolved

src/datasets/utils/info_utils.py Outdated Show resolved Hide resolved

tests/test_builder.py Outdated Show resolved Hide resolved

mariosasko and others added 3 commits February 13, 2023 17:11

Apply suggestions from code review

941111f

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Merge branch 'main' of github.com:huggingface/datasets into skip-veri…

aeb637d

…fications

Style

05bd726

mariosasko merged commit cc637d1 into main Feb 13, 2023

mariosasko deleted the skip-verifications branch February 13, 2023 16:43

polinaeterna mentioned this pull request Mar 8, 2023

Fix outdated verification_mode values #5607

Merged

albertvillanova mentioned this pull request Mar 29, 2023

Fix verification_mode when ignore_verifications is passed #5683

Merged

Skip dataset verifications by default #5303

Skip dataset verifications by default #5303

Conversation

mariosasko commented Nov 25, 2022

HuggingFaceDocBuilderDev commented Nov 25, 2022 • edited Loading

lhoestq commented Dec 7, 2022

mariosasko commented Dec 7, 2022

lhoestq commented Dec 7, 2022

polinaeterna commented Jan 23, 2023

mariosasko commented Jan 24, 2023

lhoestq commented Jan 24, 2023

polinaeterna commented Jan 25, 2023 • edited Loading

lhoestq commented Jan 25, 2023

albertvillanova commented Jan 26, 2023

github-actions bot commented Jan 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jan 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko commented Jan 30, 2023

lhoestq left a comment

Choose a reason for hiding this comment

polinaeterna left a comment

Choose a reason for hiding this comment

polinaeterna Feb 1, 2023

Choose a reason for hiding this comment

mariosasko Feb 10, 2023

Choose a reason for hiding this comment

polinaeterna Feb 1, 2023

Choose a reason for hiding this comment

mariosasko Feb 10, 2023

Choose a reason for hiding this comment

polinaeterna Feb 10, 2023

Choose a reason for hiding this comment

mariosasko Feb 13, 2023

Choose a reason for hiding this comment

mariosasko commented Feb 10, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Feb 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Feb 13, 2023

Benchmark: benchmark_array_xd.json

HuggingFaceDocBuilderDev commented Nov 25, 2022 •

edited

Loading

polinaeterna commented Jan 25, 2023 •

edited

Loading