- Add in private repo: tensorflow-datasets
- Create a
tensorflow_datasets/text/<your_dataset_name>.py
file. Look at how other datasets were added, e.g.tensorflow_datasets/text/wikitext.py
- Create an empty checksum file:
tensorflow_datasets/url_checksums/<your_dataset_name>.txt
- Add an import statement in
tensorflow_datasets/text/__init__.py
. Looking at the other import statements it is easy to see what should be done here - Run
python -m tensorflow_datasets.scripts.download_and_prepare --register_checksums --datasets=<your_dataset_name>
- Adapt your
tensorflow_datasets/text/<your_dataset_name>.py
file and rerunpython -m tensorflow_datasets.scripts.download_and_prepare --register_checksums --datasets=<your_dataset_name>
until no errors appear anymore, which means that you dataset can succesfully be loaded! - Add a test file:
tensorflow_datasets/text/<your_dataset_name>_test.py
and create dummy data. Look at how it's done for other examples. - The dummy data should consists of the same folder structure than the one that is created when downloading the dataset from the official website
- Add dummy data that consists of a) A dummy train and/or test and/or validation data file in the format as the real one has and for each dummy data file a .json file having the exact same output then the one that you expect to get.
- Add to nlp repo
- Execute the conversion script in the nlp repo as follows:
python nlp-cli convert --tfds_directory <path/to/tensorflow_datasets/text> --tfds_rel_filename <your_dataset_name>.py --nlp_directory <path/to/nlp/datasets/nlp>
- Now you should have a converted file in the folder
nlp/datasets/nlp/<your_dataset_name>/<your_dataset_name>.py
. - Upload the folder to aws. To get access to aws, first ask Julien. Make sure you don't accidently delete anything here. Use the command
aws s3 cp <path/to/nlp/datasets/nlp/<your_dataset_name>/<your_dataset_name>.py s3://datasets.huggingface.co/nlp/<your_dataset_name>/<your_dataset_name>.py
- Make sure that your object has public read access on AWS. Otherwise it cannot be downloaded.
- Awesome, your dataset can now be used with nlp. You should try out that everything works as expected by opening a python shell and loading your dataset:
import nlp dataset = nlp.load("your_dataset_name") ...