This package contains the necessary files to create the probing tests.
You need to unzip the unimorph.zip
file under ../unzipped
directory, and download the word frequency lists for the languages you are interested in. We use fastText vector files trained on wiki, since the words are ordered with frequency. Save the vector files under ../embeddings
.
We already created the files for you under ../intrinsic/data
. In case you want to generate with different settings or for different languages:
Available Functions:
-
prepareSingleTests.py
- Prepares single feature tests described in the paperargs.feat
: If1
, prepares unimorph related, single probing tasks e.g. Case, Gender, Tenseargs.common
: If1
, creates the common tests POS, CharacterBin and TagCountargs.pseudo
: If1
, prepares Wuggy tests. Reads the pseudowords from thegenerated_wuggy_files
folder.args.nonlabelratio=0.3
: specifies the ratio of tokens with the 'None' label.args.savedir
: The folder path of the probing tasks
-
preparePairedTests.py
- Prepares OddFeat and SameFeat testsargs.keepratio=0.2
: The desired rare word ratioargs.savedir
: The folder path of the probing tasks
You need to download the sigmorphon 2019 dataset from here and extract it to ../unzipped/sigmorphon19
]. Also you need the Universal Dependencies treebank v2.4. Please download it and extract it to ../unzipped/ud-treebanks-v2.4
.
prepareContextualTests.py
- Prepares all tests for the contextual probing tasksargs.feat
: The tests which will be generated (separated by commas). All tests will be generated if this argument is not given.args.zscore
: Sentences which lengths do not match the z score will not be used for the probing tasksargs.lang
: The languages (separated by commas) for which probing tasks will be generated. Tests for all languages will be generated if this argument is not given.args.size
: The number of instances for each probing taskargs.savedir
: The directory in which the probing tasks will be saved