Skip to content

thjbdvlt/spacy-french-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

syntactic dependency parser for french with spacy.

this repository is comprised of scripts that fetch and prepare data to train a syntactic dependencies parser with spacy for the french language, along with a configuration file and script to train it. the model itself is available under releases.

the data used for the training is an aggregation of three UD datasets and makes some minor changes to these datasets.

in the datasets i used, the word du is splitted into its logical component de and le. a text like on parle du ciel becomes on parle de le ciel in the .conllu files. but in the texts i have to analyze, du isn't splitted at all, so i need to unsplit it. thus the following:

11-12	du	...	_	_	_	_
11	de	...	19	case	_	_
12	le	...	11	det	_	_

is transformed into:

11	du	...	19	case:det	_	_

upon that, some labels are replaced by others, and sentences containing certain labels (such as dep which indicates than the parsing failed) are removed. for a list of replaced or removed labels, refer the file lookup_labels.txt.

usage

the parser is not a full pipeline. you have to source it from another pipeline as a component:

import spacy

# load your main pipeline
nlp = spacy.load('fr_core_news_sm', exclude=['parser'])

# load the model containing the parser
nlp_deps = spacy.load('./model', exclude=['tokenizer'])

# put the parser in the main pipeline
nlp.add_pipe('parser', source=nlp_deps)