Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Google 1-billion benchmark data on PyTorch #644

Closed
h56cho opened this issue Nov 19, 2019 · 14 comments
Closed

Using Google 1-billion benchmark data on PyTorch #644

h56cho opened this issue Nov 19, 2019 · 14 comments

Comments

@h56cho
Copy link

h56cho commented Nov 19, 2019

Hello,

I am new to NLP and I have some questions.

I downloaded the Google 1-billion benchmark dataset, and I am trying to use the dataset on PyTorch:

               
# Import packages 
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from torchtext.data import Field, BucketIterator, TabularDataset
from transformers import OpenAIGPTConfig, OpenAIGPTTokenizer, OpenAIGPTLMHeadModel
from transformers import AdamW, WarmupLinearSchedule
from scipy.spatial import distance
import spacy
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data import Field, BPTTIterator
import tensorflow as tf
#import lineflow as lf
#import lineflow.datasets as lfds
import math
import random
import numpy as np
import pandas as pd 
import time

# set hyperparameters for this experiment
bptt = 30
batch_size = 64
lr = 0.01 # learning rate
#criterion = nn.CrossEntropyLoss() # loss criterion
log_interval = 200
nlayer = 6

# define tokenizer
en = spacy.load('en')

def Sp_Tokenizer(text): 
    return [tok.text for tok in en.tokenizer(text)]

# define the English text field
TEXT = Field(tokenize = Sp_Tokenizer,
             init_token = '<sos>',
             eos_token = '<eos>',
             unk_token = '<unk>',
             pad_token = '<pad>',
             tokenizer_language = 'en',
             lower = True)

# loading Google 1 Billion Benchmark dataset
billion_google = open('/Users/dev/billion_google', encoding='utf-8').read()
billion_google_dict = {'English' : [line for line in billion_google]}
# convert billion_google into a pandas dataframe
billion_google_df = pd.DataFrame(billion_google_dict, columns=["English"])

# remove very long sentences
billion_google_df['eng_len'] = billion_google_df['English'].str.count(' ')
billion_google_df = billion_google_df.query('eng_len < 1025')

# create train and test set 
train_billion_google, test_billion_google = train_test_split(billion_google_df, test_size=0.2)
train_billion_google.to_csv("train_billion_google.csv", index=False)
test_billion_google.to_csv("test_billion_google.csv", index=False)

data_fields = [('English', TEXT)]
train_billion_google, test_billion_google = TabularDataset.splits(path='./', 
                                                                  train='train_billion_google.csv',
                                                                  validation='test_billion_google.csv', 
                                                                  format='csv', 
                                                                  fields=data_fields)

I also want to make a use of WikiText2 that is built in PyTorch:

train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)

...and I want my train_billion_google to have the same structure as my train_Wiki2, more specifically, I want my train_billion_google to store a list of individual tokens under train_billion_google.examples[0].text.

How can I do this?

Thank you,

@h56cho h56cho closed this as completed Nov 19, 2019
@bentrevett
Copy link
Contributor

You should use the LanguageModelingDataset instead of the TabularDataset.

@h56cho h56cho reopened this Nov 19, 2019
@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

Hello,
Thank you for your reply.
I am new to torchtext and I am unfamiliar with so many things.
How can I use LangaugeModelingDataset with splits?
I want to split my billion_google dataset into train_billion_google,val_billion_google, and test_billion_google, just as in the case of the WikiText2 dataset when I execute the line of code below:

train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)

When I execute:

train_billion_google, val_billion_google, test_billion_google
                                            = LanguageModelingDataset.splits(path = '/Users/dev/billion_google', 
                                                encoding = 'utf-8', 
                                                text_field = TEXT)

I get the following error:

ValueError: not enough values to unpack (expected 3, got 0)

Thank you again,

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Nov 19, 2019

Thanks @bentrevett
We are re-writing several datasets, including wikitext2 dataset @h56cho .

In your case, it should be pretty simple to use our new pattern (take a look here). If you would like to add this billion_google dataset to torchtext, please open a PR and I'm happy to review/land it. Regarding splits function, the new pattern is more compatible with torch.utils.data and you should be able to use those functions there. Here is our example.

@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

Hello,

My background is not computer programming and I am having hard time understanding the posts.
If you can, could you please provide me with some codes as to how I can apply the splits function to billion_google?

I tried the code below and it's giving me an error...

train_billion_google, val_billion_google, test_billion_google
                                            = LanguageModelingDataset.splits(path = '/Users/dev/billion_google', 
                                                encoding = 'utf-8', 
                                                text_field = TEXT)
ValueError: not enough values to unpack (expected 3, got 0)

@zhangguanheng66
Copy link
Contributor

@h56cho You could download the file, which is very large, and pass the path to LanguageModelingDataset.

@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

Hello,
Yes, I know that but my question is I don’t know how to split billion_google data into train, test, and validation sets. The splits function is not working for me, Python is complaining that there are not enough values to unpack.

@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

Hello,

I tried a slightly different approach and it's still not working for me...

# loading Google 1 Billion Benchmark dataset
billion_google = LanguageModelingDataset(path = '/Users/dev/billion_google', 
                                         encoding = 'utf-8', 
                                         text_field = TEXT_billion_google)

# note: I created empty files train_billion_google, val_billion_google, and test_billion_google
# under the appropriate directory before executing this code.
train_billion_google, val_billion_google, test_billion_google = billion_google.splits(path = '/Users/dev/',
                                                                                      train = 'train_billion_google',
                                                                                      validation = 'val_billion_google',
                                                                                      test = 'test_billion_google', 
                                                                                      text_field = TEXT_billion_google)

Python indicates that the train_billion_google set in this case contains only 13 tokens, which is clearly wrong. What am I doing wrong here? I can't split billion_google!

@zhangguanheng66
Copy link
Contributor

Can you check the training file 'train_billion_google' ?

@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

Hello,

The file file train_billion_google is only 196 bytes large and when I open it, it's completely empty.

@zhangguanheng66
Copy link
Contributor

So you didn't download it correctly.

@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

Hello,

I thought that I am supposed to manually make the empty files named train_billion_google, val_billion_google, test_billion_google before executing the line of code below:

# loading Google 1 Billion Benchmark dataset
billion_google = LanguageModelingDataset(path = '/Users/dev/billion_google', 
                                         encoding = 'utf-8', 
                                         text_field = TEXT_billion_google)

# note: I created empty files train_billion_google, val_billion_google, and test_billion_google
# under the appropriate directory before executing this code.
train_billion_google, val_billion_google, test_billion_google = billion_google.splits(path = '/Users/dev/',
                                                                                      train = 'train_billion_google',
                                                                                      validation = 'val_billion_google',
                                                                                      test = 'test_billion_google', 
                                                                                      text_field = TEXT_billion_google)

my "billion_google" file is full of texts, so I don't think I downloaded it in a wrong way (it was a successful download). The files train_billion_google, val_billion_google, test_billion_google are empty because I artificially made those 3 empty files and placed under the folder /User/dev/, because I thought I was supposed to do that to run the lines of code that I typed above.

again, how can I make the splits function to work?

Thank you,

@h56cho
Copy link
Author

h56cho commented Nov 19, 2019

I don't get how I can use the "new pattern" in TorchText to make splits function to work.

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Nov 19, 2019

It quite confused me. To test your code, you should add "something" in train_billion_google file and see if it load the information as expected.

For the new pattern, it's not supposed to use splits. If you take a look at the PR I attached above, it should be pretty clear to put together a new dataset (a.k.a. billion_google).

@h56cho
Copy link
Author

h56cho commented Nov 20, 2019

Thank you!

Your answer solved my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants