Skip to content

Commit

Permalink
Docs maintenance (#3999)
Browse files Browse the repository at this point in the history
* ✨ doc maintenance

* πŸ– apply feedback

* πŸ– use the shorthand syntax
  • Loading branch information
stevhliu committed Mar 30, 2022
1 parent d7a3a76 commit bfb3d09
Show file tree
Hide file tree
Showing 11 changed files with 14 additions and 11 deletions.
4 changes: 2 additions & 2 deletions docs/source/access.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ A [`Dataset`] object is returned when you load an instance of a dataset. This ob

## Metadata

The [`Dataset`] object contains a lot of useful information about your dataset. For example, call [`dataset.info`] to return a short description of the dataset, the authors, and even the dataset size. This will give you a quick snapshot of the datasets most important attributes.
The [`Dataset`] object contains a lot of useful information about your dataset. For example, access [`DatasetInfo`] to return a short description of the dataset, the authors, and even the dataset size. This will give you a quick snapshot of the datasets most important attributes.

```py
>>> dataset.info
Expand Down Expand Up @@ -73,7 +73,7 @@ List the columns names with [`Dataset.column_names`]:
['idx', 'label', 'sentence1', 'sentence2']
```

Get detailed information about the columns with [`Dataset.features`]:
Get detailed information about the columns with [`~datasets.Features`]:

```py
>>> dataset.features
Expand Down
2 changes: 1 addition & 1 deletion docs/source/dataset_script.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ class SuperGlue(datasets.GeneratorBasedBuilder):

### Default configurations

Users must specify a configuration name when they load a dataset with multiple configurations. Otherwise, πŸ€— Datasets will raise a `ValueError`, and prompt the user to select a configuration name. You can avoid this by setting a default dataset configuration with the ['datasets.DatasetBuilder.DEFAULT_CONFIG_NAME'] attribute:
Users must specify a configuration name when they load a dataset with multiple configurations. Otherwise, πŸ€— Datasets will raise a `ValueError`, and prompt the user to select a configuration name. You can avoid this by setting a default dataset configuration with the `DEFAULT_CONFIG_NAME` attribute:

```py
class NewDataset(datasets.GeneratorBasedBuilder):
Expand Down
2 changes: 1 addition & 1 deletion docs/source/filesystems.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Save your dataset with `botocore.session.Session` and a custom AWS profile:

## Loading datasets

When you are ready to use your dataset again, reload it with `datasets.load_from_disk`:
When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]:

```py
>>> from datasets import load_from_disk
Expand Down
2 changes: 1 addition & 1 deletion docs/source/load_hub.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Once you are happy with the dataset you want, load it in a single line with [`lo

Some datasets, like the [General Language Understanding Evaluation (GLUE)](https://huggingface.co/datasets/glue) benchmark, are actually made up of several datasets. These sub-datasets are called **configurations**, and you must explicitly select one when you load the dataset. If you don't provide a configuration name, πŸ€— Datasets will raise a `ValueError` and remind you to select a configuration.

Use `get_dataset_config_names` to retrieve a list of all the possible configurations available to your dataset:
Use the [`get_dataset_config_names`] function to retrieve a list of all the possible configurations available to your dataset:

```py
from datasets import get_dataset_config_names
Expand Down
2 changes: 1 addition & 1 deletion docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -383,7 +383,7 @@ See the [Metrics](./how_to_metrics#custom-metric-loading-script) guide for more

### Load configurations

It is possible for a metric to have different configurations. The configurations are stored in the ['datasets.Metric.config_name'] attribute. When you load a metric, provide the configuration name as shown in the following:
It is possible for a metric to have different configurations. The configurations are stored in the `config_name` parameter in [`MetricInfo`] attribute. When you load a metric, provide the configuration name as shown in the following:

```
>>> from datasets import load_metric
Expand Down
4 changes: 2 additions & 2 deletions docs/source/metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ If you are using a benchmark dataset, you need to select a metric that is associ

## Metrics object

Before you begin using a [`Metric`] object, you should get to know it a little better. As with a dataset, you can return some basic information about a metric. For example, use `Metric.inputs_descriptions` to get more information about a metrics expected input format and some usage examples:
Before you begin using a [`Metric`] object, you should get to know it a little better. As with a dataset, you can return some basic information about a metric. For example, access the `inputs_description` parameter in [`datasets.MetricInfo`] to get more information about a metrics expected input format and some usage examples:

```py
>>> print(metric.inputs_description)
Expand Down Expand Up @@ -71,7 +71,7 @@ Notice for the MRPC configuration, the metric expects the input format to be zer

## Compute metric

Once you have loaded a metric, you are ready to use it to evaluate a models predictions. Provide the model predictions and references to `Metric.compute`:
Once you have loaded a metric, you are ready to use it to evaluate a models predictions. Provide the model predictions and references to [`~datasets.Metric.compute`]:

```py
>>> model_predictions = model(model_inputs)
Expand Down
2 changes: 2 additions & 0 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,10 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
[[autodoc]] datasets.IterableDataset
- remove_columns
- cast_column
- cast
- __iter__
- map
- rename_column
- filter
- shuffle
- skip
Expand Down
2 changes: 1 addition & 1 deletion docs/source/process.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,7 @@ You can also use [`Dataset.map`] with indices if you set `with_indices=True`. Th
]
```

You can also use [`Dataset.map`] with the rank of the process if you set `with_rank=True`. This is analogous to `with_indices`. The `rank` argument in the mapped function goes after the `index` one if it is already present. The main use-case for it is to parallelize your computation across several GPUs. This requires setting *multiprocess.set_start_method("spawn")*, without which you will receive a CUDA error: *RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method*.
You can also use [`Dataset.map`] with the rank of the process if you set `with_rank=True`. This is analogous to `with_indices`. The `rank` argument in the mapped function goes after the `index` one if it is already present. The main use-case for it is to parallelize your computation across several GPUs. This requires setting `multiprocess.set_start_method("spawn")`, without which you will receive a CUDA error: `RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method`.


```py
Expand Down
2 changes: 1 addition & 1 deletion docs/source/repository_structure.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
To host and share your dataset, you can create a dataset repository on the Hugging Face Dataset Hub and upload your data files.

This guide will show you how to structure your dataset repository when you upload it.
A dataset with a supported structure can be loaded automatically with `load_dataset`, and it will have a preview on its dataset page on the Hub.
A dataset with a supported structure can be loaded automatically with [`~datasets.load_dataset`], and it will have a preview on its dataset page on the Hub.

<Tip>

Expand Down
2 changes: 1 addition & 1 deletion docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ Casting only works if the original feature type and new feature type are compati

</Tip>

Use [`Dataset.cast_column`] to change the feature type of just one column. Pass the column name and its new feature type as arguments:
Use [`IterableDataset.cast_column`] to change the feature type of just one column. Pass the column name and its new feature type as arguments:

```py
>>> dataset.features
Expand Down
1 change: 1 addition & 0 deletions docs/source/use_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ After you set the format, wrap the dataset with `torch.utils.data.DataLoader`. Y

If you are using TensorFlow, you can use [`Dataset.to_tf_dataset`] to wrap the dataset with a **tf.data.Dataset**, which is natively understood by Keras.
This means a **tf.data.Dataset** object can be iterated over to yield batches of data, and can be passed directly to methods like **model.fit()**.

[`Dataset.to_tf_dataset`] accepts several arguments:

1. `columns` specify which columns should be formatted (includes the inputs and labels).
Expand Down

1 comment on commit bfb3d09

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==5.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012457 / 0.011353 (0.001104) 0.004979 / 0.011008 (-0.006029) 0.038140 / 0.038508 (-0.000368) 0.036441 / 0.023109 (0.013332) 0.381759 / 0.275898 (0.105861) 0.384204 / 0.323480 (0.060724) 0.010549 / 0.007986 (0.002564) 0.005230 / 0.004328 (0.000901) 0.010160 / 0.004250 (0.005910) 0.042544 / 0.037052 (0.005491) 0.354605 / 0.258489 (0.096116) 0.404544 / 0.293841 (0.110703) 0.046306 / 0.128546 (-0.082240) 0.015473 / 0.075646 (-0.060174) 0.310613 / 0.419271 (-0.108658) 0.063740 / 0.043533 (0.020208) 0.364119 / 0.255139 (0.108980) 0.405363 / 0.283200 (0.122163) 0.110746 / 0.141683 (-0.030937) 2.120290 / 1.452155 (0.668135) 2.222590 / 1.492716 (0.729874)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.310647 / 0.018006 (0.292641) 0.499198 / 0.000490 (0.498708) 0.040046 / 0.000200 (0.039846) 0.000856 / 0.000054 (0.000802)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028820 / 0.037411 (-0.008592) 0.116194 / 0.014526 (0.101668) 0.117521 / 0.176557 (-0.059035) 0.177797 / 0.737135 (-0.559338) 0.115116 / 0.296338 (-0.181223)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.619095 / 0.215209 (0.403886) 6.069413 / 2.077655 (3.991758) 2.295033 / 1.504120 (0.790913) 1.979819 / 1.541195 (0.438624) 2.025178 / 1.468490 (0.556687) 0.775846 / 4.584777 (-3.808931) 6.321369 / 3.745712 (2.575657) 3.202163 / 5.269862 (-2.067699) 1.539173 / 4.565676 (-3.026504) 0.091141 / 0.424275 (-0.333134) 0.015016 / 0.007607 (0.007409) 0.775332 / 0.226044 (0.549288) 7.866400 / 2.268929 (5.597472) 3.090199 / 55.444624 (-52.354425) 2.344451 / 6.876477 (-4.532026) 2.408549 / 2.142072 (0.266477) 0.952851 / 4.805227 (-3.852377) 0.188744 / 6.500664 (-6.311920) 0.074212 / 0.075469 (-0.001257)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.993420 / 1.841788 (0.151632) 16.292291 / 8.074308 (8.217983) 44.430821 / 10.191392 (34.239429) 1.089381 / 0.680424 (0.408957) 0.640221 / 0.534201 (0.106020) 0.622160 / 0.579283 (0.042877) 0.688170 / 0.434364 (0.253807) 0.399084 / 0.540337 (-0.141254) 0.417387 / 1.386936 (-0.969549)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011510 / 0.011353 (0.000158) 0.005099 / 0.011008 (-0.005909) 0.039140 / 0.038508 (0.000632) 0.038551 / 0.023109 (0.015442) 0.375323 / 0.275898 (0.099425) 0.413009 / 0.323480 (0.089529) 0.007600 / 0.007986 (-0.000386) 0.004204 / 0.004328 (-0.000125) 0.007726 / 0.004250 (0.003475) 0.042863 / 0.037052 (0.005810) 0.401445 / 0.258489 (0.142956) 0.420656 / 0.293841 (0.126815) 0.052176 / 0.128546 (-0.076371) 0.014797 / 0.075646 (-0.060849) 0.323009 / 0.419271 (-0.096263) 0.070745 / 0.043533 (0.027212) 0.394936 / 0.255139 (0.139797) 0.435609 / 0.283200 (0.152409) 0.103147 / 0.141683 (-0.038536) 2.231597 / 1.452155 (0.779442) 2.271154 / 1.492716 (0.778438)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.282262 / 0.018006 (0.264256) 0.490697 / 0.000490 (0.490207) 0.016967 / 0.000200 (0.016767) 0.000296 / 0.000054 (0.000242)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027440 / 0.037411 (-0.009971) 0.121945 / 0.014526 (0.107419) 0.128258 / 0.176557 (-0.048298) 0.167555 / 0.737135 (-0.569581) 0.127547 / 0.296338 (-0.168791)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.581800 / 0.215209 (0.366591) 5.900716 / 2.077655 (3.823061) 2.215627 / 1.504120 (0.711507) 1.883143 / 1.541195 (0.341948) 1.934968 / 1.468490 (0.466478) 0.729405 / 4.584777 (-3.855372) 6.400845 / 3.745712 (2.655133) 4.914529 / 5.269862 (-0.355332) 1.598092 / 4.565676 (-2.967584) 0.085621 / 0.424275 (-0.338654) 0.013807 / 0.007607 (0.006200) 0.758021 / 0.226044 (0.531977) 7.539040 / 2.268929 (5.270111) 2.915067 / 55.444624 (-52.529557) 2.186172 / 6.876477 (-4.690305) 2.349870 / 2.142072 (0.207797) 0.953444 / 4.805227 (-3.851783) 0.191821 / 6.500664 (-6.308843) 0.073366 / 0.075469 (-0.002103)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.051319 / 1.841788 (0.209531) 16.289497 / 8.074308 (8.215189) 43.079665 / 10.191392 (32.888273) 1.085309 / 0.680424 (0.404885) 0.651766 / 0.534201 (0.117565) 0.604061 / 0.579283 (0.024778) 0.684142 / 0.434364 (0.249778) 0.397893 / 0.540337 (-0.142444) 0.424764 / 1.386936 (-0.962172)

CML watermark

Please sign in to comment.