Why I get only numbers as a filling mask prediction after pre training on my data? #1727

slusarczyk41 · 2020-02-20T22:04:48Z

❓ Questions and Help

After pre training (existing model) on my own data I get only numbers as mask filling prediction. Why?

Code

All steps I did are here in notebook:

So, in a nutshell:

I followed pre training README with my own data (so I replaced wiki dataset with my own, all the rest is copy/pasted)
I used model pre trained by other people (with --restore-file option)
I used gpt2_bpe encoder, vocab and dict downloaded from README.pretraining
After training a bit I tested model and the output for mask filling is always some number (I guess that it does not now how to decode them to words)

I cannot find a way to solve it, since I did everything like it is written in README, except replacing data at the beginning.

Example:

In: ! Notice double space before <mask> !
roberta.fill_mask('Bolesław chrobry urodził się w <mask>.', topk = 10)

Out:
[('Bolesław chrobry urodził się w 35735.', 0.00015262558008544147, '35735'),
('Bolesław chrobry urodził się w 1352.', 0.00015025328320916742, '1352'),
('Bolesław chrobry urodził się w 48580.', 0.00014154364180285484, '48580'),
('Bolesław chrobry urodził się w 2960.', 0.00013927527470514178, '2960'),
('Bolesław chrobry urodził się w 44026.', 0.0001296651316806674, '44026'),
('Bolesław chrobry urodził się w 49958.', 0.0001274164387723431, '49958'),
('Bolesław chrobry urodził się w 2556.', 0.00012739280646201223, '2556'),
('Bolesław chrobry urodził się w 34301.', 0.000126967832329683, '34301'),
('Bolesław chrobry urodził się w 22433.', 0.0001259078417206183, '22433'),
('Bolesław chrobry urodził się w 38204.', 0.0001207769091706723, '38204')]

slusarczyk41 · 2020-02-21T12:39:25Z

I changed model reading from

model_path = "roberta/"
loaded = hub_utils.from_pretrained(
    model_name_or_path=model_path,
    checkpoint_file="checkpoint_best.pt",
    data_name_or_path=model_path,
    bpe="sentencepiece",
    sentencepiece_vocab=os.path.join(model_path, "sentencepiece.model"),
    load_checkpoint_heads=True,
    archive_map=RobertaModel.hub_models(),
    cpu=True
)
roberta = RobertaHubInterface(loaded['args'], loaded['task'], loaded['models'][0])

to

roberta_agora = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/agora')

And it is fine now.

Summary: Pull Request resolved: fairinternal/fairseq-py#1727 Reviewed By: myleott Differential Revision: D27213955 Pulled By: sshleifer fbshipit-source-id: be84e7f7c1c55c407ee7445fad9b3026a79763fb

slusarczyk41 added needs triage question labels Feb 20, 2020

slusarczyk41 closed this as completed Feb 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why I get only numbers as a filling mask prediction after pre training on my data? #1727

Why I get only numbers as a filling mask prediction after pre training on my data? #1727

slusarczyk41 commented Feb 20, 2020 •

edited

Loading

slusarczyk41 commented Feb 21, 2020

Why I get only numbers as a filling mask prediction after pre training on my data? #1727

Why I get only numbers as a filling mask prediction after pre training on my data? #1727

Comments

slusarczyk41 commented Feb 20, 2020 • edited Loading

❓ Questions and Help

After pre training (existing model) on my own data I get only numbers as mask filling prediction. Why?

Code

Example:

slusarczyk41 commented Feb 21, 2020

slusarczyk41 commented Feb 20, 2020 •

edited

Loading