Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Low number of unique words predicted #21

Open
mocialov opened this issue Feb 10, 2018 · 1 comment
Open

Low number of unique words predicted #21

mocialov opened this issue Feb 10, 2018 · 1 comment

Comments

@mocialov
Copy link

I would like to perform a sanity check by passing some input to the model and reading the output text.

Following the PyTorch tutorial on language modelling (https://github.com/pytorch/examples/blob/master/word_language_model/generate.py), I have edited the evaluate function:

def evaluate(data_source, batch_size=10):
    # Turn on evaluation mode which disables dropout.
    if args.model == 'QRNN': model.reset()
    model.eval()
    total_loss = 0
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(batch_size)
    for i in range(0, data_source.size(0) - 1, args.bptt):
        data, targets = get_batch(data_source, i, args, evaluation=True)

        print ("inputs")
        inp = data.cpu().data.numpy()
        for input_ in inp:
            print ([created_inverse_tokenizer_during_training[i] for i in input_])

        output, hidden = model(data, hidden)

        word_weights = output.squeeze().data.div(args.temperature).exp().cpu()
        word_idx = torch.multinomial(word_weights, 10)

        print ("outputs")
        for word_ in word_idx:
            for item_ in word_:
                print ("next word", created_inverse_tokenizer_during_training[item_])
            print ("")

        output_flat = output.view(-1, ntokens)
        total_loss += len(data) * criterion(output_flat, targets).data
        hidden = repackage_hidden(hidden)
    return total_loss[0] / len(data_source)

, where created_inverse_tokenizer_during_training is idx2word from Dictionary class

I am testing on ptb dataset and I get the following with approximately 60 perplexity value:

inputs:
[made, value, $, their, intends, N, also, south, , or]
[much, criteria, N, office, to, return, closed, as, one, $]
[difference, devised, billion, visits, restrict, on, sharply, it, analyst, N]
[in, by, , as, the, assets, lower, became, peter, a]
[liquidity, benjamin, a, , rtc, for, across, more, , share]
[in, graham, , breaks, to, security, europe, clear, of, in]
[the, an, , , treasury, pacific, particularly, that, , the]
[pit, analyst, by, but, borrowings, and, in, a, &, fiscal]
[, and, an, massage, only, an, frankfurt, repeat, co., year]
[it, author, , no, unless, N, although, of, new, just]
["s", in, not, matter, the, N, london, the, york, ended]
[too, the, , how, agency, return, and, october, said, up]
[soon, 1930s, though, , receives, on, a, N, the, from]
[to, and, , is, specific, equity, few, crash, gold, $]
[tell, , , still, congressional, , other, was, market, N]
[but, who, english, associated, authorization, the, markets, "nt", already, million]
[people, is, butler, in, , loan, recovered, at, had, in]
[do, widely, in, many, such, growth, some, hand, some, fiscal]
["nt", considered, his, minds, agency, offset, ground, , good, N]
[seem, to, , with, , continuing, after, professionals, , and]
[to, be, proceeds, , borrowing, real-estate, stocks, dominated, technical, $]
[be, the, as, fronts, is, loan, began, municipal, factors, N]
[unhappy, father, if, for, unauthorized, losses, to, trading, that, million]
[with, of, the, , and, in, rebound, throughout, would, in]
[it, modern, realistic, and, expensive, the, in, the, have, N]

outputs:
[berlitz, hydro-quebec, banknote, centrust, gitano, cluett, guterman, aer, fromstein, calloway]
[berlitz, centrust, cluett, fromstein, aer, gitano, hydro-quebec, guterman, calloway, banknote]
[banknote, hydro-quebec, calloway, fromstein, berlitz, gitano, cluett, aer, guterman, centrust]
[calloway, berlitz, cluett, centrust, aer, gitano, hydro-quebec, banknote, guterman, fromstein]
[fromstein, hydro-quebec, aer, banknote, gitano, berlitz, calloway, cluett, centrust, guterman]
[calloway, hydro-quebec, guterman, fromstein, berlitz, banknote, cluett, centrust, gitano, aer]
[gitano, fromstein, hydro-quebec, cluett, calloway, centrust, berlitz, guterman, aer, banknote]
[berlitz, gitano, banknote, cluett, calloway, aer, centrust, fromstein, hydro-quebec, guterman]
[calloway, gitano, guterman, berlitz, centrust, hydro-quebec, cluett, aer, fromstein, banknote]
[hydro-quebec, berlitz, fromstein, gitano, cluett, calloway, aer, centrust, guterman, banknote]
[aer, cluett, fromstein, berlitz, guterman, calloway, hydro-quebec, centrust, banknote, gitano]
[cluett, calloway, centrust, fromstein, banknote, gitano, guterman, hydro-quebec, aer, berlitz]
[hydro-quebec, fromstein, calloway, aer, banknote, berlitz, cluett, gitano, centrust, guterman]
[banknote, gitano, aer, centrust, cluett, fromstein, calloway, guterman, hydro-quebec, berlitz]
[calloway, aer, gitano, berlitz, fromstein, cluett, guterman, banknote, hydro-quebec, centrust]
[banknote, cluett, fromstein, berlitz, gitano, aer, centrust, calloway, hydro-quebec, guterman]
[cluett, fromstein, aer, calloway, guterman, banknote, berlitz, gitano, centrust, hydro-quebec]
[aer, guterman, berlitz, gitano, centrust, cluett, calloway, hydro-quebec, fromstein, banknote]
[centrust, fromstein, cluett, berlitz, aer, banknote, guterman, gitano, calloway, hydro-quebec]
[guterman, banknote, fromstein, cluett, gitano, calloway, aer, centrust, berlitz, hydro-quebec]
[calloway, berlitz, aer, banknote, hydro-quebec, fromstein, cluett, guterman, gitano, centrust]
[banknote, hydro-quebec, berlitz, fromstein, guterman, calloway, cluett, centrust, gitano, aer]
[centrust, aer, fromstein, cluett, hydro-quebec, calloway, gitano, berlitz, guterman, banknote]
[fromstein, centrust, aer, banknote, berlitz, guterman, gitano, hydro-quebec, calloway, cluett]
[cluett, banknote, hydro-quebec, gitano, berlitz, fromstein, calloway, guterman, centrust, aer]

As you can see, the number of unique words in the output is rather small. Why is that? Or am I doing it wrong?

@andrewPoulton
Copy link

andrewPoulton commented Aug 28, 2018

Probably related to this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants