syntactic dependency parser for french with spacy.
this repository is comprised of scripts that fetch and prepare data to train a syntactic dependencies parser with spacy for the french language, along with a configuration file and script to train it. the model itself is available under releases.
the data used for the training is an aggregation of three UD datasets and makes some minor changes to these datasets.
in the datasets i used, the word du is splitted into its logical component de and le. a text like on parle du ciel becomes on parle de le ciel in the .conllu
files. but in the texts i have to analyze, du isn't splitted at all, so i need to unsplit it. thus the following:
11-12 du ... _ _ _ _
11 de ... 19 case _ _
12 le ... 11 det _ _
is transformed into:
11 du ... 19 case:det _ _
upon that, some labels are replaced by others, and sentences containing certain labels (such as dep
which indicates than the parsing failed) are removed. for a list of replaced or removed labels, refer the file lookup_labels.txt.
the parser is not a full pipeline. you have to source it from another pipeline as a component:
import spacy
# load your main pipeline
nlp = spacy.load('fr_core_news_sm', exclude=['parser'])
# load the model containing the parser
nlp_deps = spacy.load('./model', exclude=['tokenizer'])
# put the parser in the main pipeline
nlp.add_pipe('parser', source=nlp_deps)