Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why I get only numbers as a filling mask prediction after pre training on my data? #1727

Closed
slusarczyk41 opened this issue Feb 20, 2020 · 1 comment

Comments

@slusarczyk41
Copy link

slusarczyk41 commented Feb 20, 2020

❓ Questions and Help

After pre training (existing model) on my own data I get only numbers as mask filling prediction. Why?

Code

All steps I did are here in notebook:

So, in a nutshell:

  • I followed pre training README with my own data (so I replaced wiki dataset with my own, all the rest is copy/pasted)
  • I used model pre trained by other people (with --restore-file option)
  • I used gpt2_bpe encoder, vocab and dict downloaded from README.pretraining
  • After training a bit I tested model and the output for mask filling is always some number (I guess that it does not now how to decode them to words)

I cannot find a way to solve it, since I did everything like it is written in README, except replacing data at the beginning.

Example:

In: ! Notice double space before <mask> !
roberta.fill_mask('Bolesław chrobry urodził się w  <mask>.', topk = 10)

Out:
[('Bolesław chrobry urodził się w 35735.', 0.00015262558008544147, '35735'),
('Bolesław chrobry urodził się w 1352.', 0.00015025328320916742, '1352'),
('Bolesław chrobry urodził się w 48580.', 0.00014154364180285484, '48580'),
('Bolesław chrobry urodził się w 2960.', 0.00013927527470514178, '2960'),
('Bolesław chrobry urodził się w 44026.', 0.0001296651316806674, '44026'),
('Bolesław chrobry urodził się w 49958.', 0.0001274164387723431, '49958'),
('Bolesław chrobry urodził się w 2556.', 0.00012739280646201223, '2556'),
('Bolesław chrobry urodził się w 34301.', 0.000126967832329683, '34301'),
('Bolesław chrobry urodził się w 22433.', 0.0001259078417206183, '22433'),
('Bolesław chrobry urodził się w 38204.', 0.0001207769091706723, '38204')]

@slusarczyk41
Copy link
Author

I changed model reading from

model_path = "roberta/"
loaded = hub_utils.from_pretrained(
    model_name_or_path=model_path,
    checkpoint_file="checkpoint_best.pt",
    data_name_or_path=model_path,
    bpe="sentencepiece",
    sentencepiece_vocab=os.path.join(model_path, "sentencepiece.model"),
    load_checkpoint_heads=True,
    archive_map=RobertaModel.hub_models(),
    cpu=True
)
roberta = RobertaHubInterface(loaded['args'], loaded['task'], loaded['models'][0])

to

roberta_agora = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/agora')

And it is fine now.

facebook-github-bot pushed a commit that referenced this issue Mar 20, 2021
Summary: Pull Request resolved: fairinternal/fairseq-py#1727

Reviewed By: myleott

Differential Revision: D27213955

Pulled By: sshleifer

fbshipit-source-id: be84e7f7c1c55c407ee7445fad9b3026a79763fb
sshleifer added a commit that referenced this issue Apr 7, 2021
Summary: Pull Request resolved: fairinternal/fairseq-py#1727

Reviewed By: myleott

Differential Revision: D27213955

Pulled By: sshleifer

fbshipit-source-id: be84e7f7c1c55c407ee7445fad9b3026a79763fb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant