-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Google 1-billion benchmark data on PyTorch #644
Comments
You should use the LanguageModelingDataset instead of the |
Hello, train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT) When I execute: train_billion_google, val_billion_google, test_billion_google
= LanguageModelingDataset.splits(path = '/Users/dev/billion_google',
encoding = 'utf-8',
text_field = TEXT) I get the following error: ValueError: not enough values to unpack (expected 3, got 0) Thank you again, |
Thanks @bentrevett In your case, it should be pretty simple to use our new pattern (take a look here). If you would like to add this |
Hello, My background is not computer programming and I am having hard time understanding the posts. I tried the code below and it's giving me an error... train_billion_google, val_billion_google, test_billion_google
= LanguageModelingDataset.splits(path = '/Users/dev/billion_google',
encoding = 'utf-8',
text_field = TEXT)
ValueError: not enough values to unpack (expected 3, got 0) |
@h56cho You could download the file, which is very large, and pass the path to |
Hello, |
Hello, I tried a slightly different approach and it's still not working for me... # loading Google 1 Billion Benchmark dataset
billion_google = LanguageModelingDataset(path = '/Users/dev/billion_google',
encoding = 'utf-8',
text_field = TEXT_billion_google)
# note: I created empty files train_billion_google, val_billion_google, and test_billion_google
# under the appropriate directory before executing this code.
train_billion_google, val_billion_google, test_billion_google = billion_google.splits(path = '/Users/dev/',
train = 'train_billion_google',
validation = 'val_billion_google',
test = 'test_billion_google',
text_field = TEXT_billion_google) Python indicates that the |
Can you check the training file 'train_billion_google' ? |
Hello, The file file |
So you didn't download it correctly. |
Hello, I thought that I am supposed to manually make the empty files named # loading Google 1 Billion Benchmark dataset
billion_google = LanguageModelingDataset(path = '/Users/dev/billion_google',
encoding = 'utf-8',
text_field = TEXT_billion_google)
# note: I created empty files train_billion_google, val_billion_google, and test_billion_google
# under the appropriate directory before executing this code.
train_billion_google, val_billion_google, test_billion_google = billion_google.splits(path = '/Users/dev/',
train = 'train_billion_google',
validation = 'val_billion_google',
test = 'test_billion_google',
text_field = TEXT_billion_google) my "billion_google" file is full of texts, so I don't think I downloaded it in a wrong way (it was a successful download). The files again, how can I make the Thank you, |
I don't get how I can use the "new pattern" in TorchText to make |
It quite confused me. To test your code, you should add "something" in For the new pattern, it's not supposed to use |
Thank you! Your answer solved my issue. |
Hello,
I am new to NLP and I have some questions.
I downloaded the Google 1-billion benchmark dataset, and I am trying to use the dataset on PyTorch:
I also want to make a use of WikiText2 that is built in PyTorch:
train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)
...and I want my
train_billion_google
to have the same structure as mytrain_Wiki2
, more specifically, I want mytrain_billion_google
to store a list of individual tokens undertrain_billion_google.examples[0].text
.How can I do this?
Thank you,
The text was updated successfully, but these errors were encountered: