Using Google 1-billion benchmark data on PyTorch #644

h56cho · 2019-11-19T02:12:10Z

Hello,

I am new to NLP and I have some questions.

I downloaded the Google 1-billion benchmark dataset, and I am trying to use the dataset on PyTorch:

               
# Import packages 
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from torchtext.data import Field, BucketIterator, TabularDataset
from transformers import OpenAIGPTConfig, OpenAIGPTTokenizer, OpenAIGPTLMHeadModel
from transformers import AdamW, WarmupLinearSchedule
from scipy.spatial import distance
import spacy
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data import Field, BPTTIterator
import tensorflow as tf
#import lineflow as lf
#import lineflow.datasets as lfds
import math
import random
import numpy as np
import pandas as pd 
import time

# set hyperparameters for this experiment
bptt = 30
batch_size = 64
lr = 0.01 # learning rate
#criterion = nn.CrossEntropyLoss() # loss criterion
log_interval = 200
nlayer = 6

# define tokenizer
en = spacy.load('en')

def Sp_Tokenizer(text): 
    return [tok.text for tok in en.tokenizer(text)]

# define the English text field
TEXT = Field(tokenize = Sp_Tokenizer,
             init_token = '<sos>',
             eos_token = '<eos>',
             unk_token = '<unk>',
             pad_token = '<pad>',
             tokenizer_language = 'en',
             lower = True)

# loading Google 1 Billion Benchmark dataset
billion_google = open('/Users/dev/billion_google', encoding='utf-8').read()
billion_google_dict = {'English' : [line for line in billion_google]}
# convert billion_google into a pandas dataframe
billion_google_df = pd.DataFrame(billion_google_dict, columns=["English"])

# remove very long sentences
billion_google_df['eng_len'] = billion_google_df['English'].str.count(' ')
billion_google_df = billion_google_df.query('eng_len < 1025')

# create train and test set 
train_billion_google, test_billion_google = train_test_split(billion_google_df, test_size=0.2)
train_billion_google.to_csv("train_billion_google.csv", index=False)
test_billion_google.to_csv("test_billion_google.csv", index=False)

data_fields = [('English', TEXT)]
train_billion_google, test_billion_google = TabularDataset.splits(path='./', 
                                                                  train='train_billion_google.csv',
                                                                  validation='test_billion_google.csv', 
                                                                  format='csv', 
                                                                  fields=data_fields)

I also want to make a use of WikiText2 that is built in PyTorch:

train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)

...and I want my train_billion_google to have the same structure as my train_Wiki2, more specifically, I want my train_billion_google to store a list of individual tokens under train_billion_google.examples[0].text.

How can I do this?

Thank you,

The text was updated successfully, but these errors were encountered:

bentrevett · 2019-11-19T10:44:24Z

You should use the LanguageModelingDataset instead of the TabularDataset.

h56cho · 2019-11-19T11:25:05Z

Hello,
Thank you for your reply.
I am new to torchtext and I am unfamiliar with so many things.
How can I use LangaugeModelingDataset with splits?
I want to split my billion_google dataset into train_billion_google,val_billion_google, and test_billion_google, just as in the case of the WikiText2 dataset when I execute the line of code below:

train_Wiki2, val_Wiki2, test_Wiki2 = torchtext.datasets.WikiText2.splits(TEXT)

When I execute:

train_billion_google, val_billion_google, test_billion_google
                                            = LanguageModelingDataset.splits(path = '/Users/dev/billion_google', 
                                                encoding = 'utf-8', 
                                                text_field = TEXT)

I get the following error:

ValueError: not enough values to unpack (expected 3, got 0)

Thank you again,

zhangguanheng66 · 2019-11-19T14:48:35Z

Thanks @bentrevett
We are re-writing several datasets, including wikitext2 dataset @h56cho .

In your case, it should be pretty simple to use our new pattern (take a look here). If you would like to add this billion_google dataset to torchtext, please open a PR and I'm happy to review/land it. Regarding splits function, the new pattern is more compatible with torch.utils.data and you should be able to use those functions there. Here is our example.

h56cho · 2019-11-19T15:32:33Z

Hello,

My background is not computer programming and I am having hard time understanding the posts.
If you can, could you please provide me with some codes as to how I can apply the splits function to billion_google?

I tried the code below and it's giving me an error...

train_billion_google, val_billion_google, test_billion_google
                                            = LanguageModelingDataset.splits(path = '/Users/dev/billion_google', 
                                                encoding = 'utf-8', 
                                                text_field = TEXT)
ValueError: not enough values to unpack (expected 3, got 0)

zhangguanheng66 · 2019-11-19T15:46:14Z

@h56cho You could download the file, which is very large, and pass the path to LanguageModelingDataset.

h56cho · 2019-11-19T15:49:51Z

Hello,
Yes, I know that but my question is I don’t know how to split billion_google data into train, test, and validation sets. The splits function is not working for me, Python is complaining that there are not enough values to unpack.

h56cho · 2019-11-19T16:36:06Z

Hello,

I tried a slightly different approach and it's still not working for me...

# loading Google 1 Billion Benchmark dataset
billion_google = LanguageModelingDataset(path = '/Users/dev/billion_google', 
                                         encoding = 'utf-8', 
                                         text_field = TEXT_billion_google)

# note: I created empty files train_billion_google, val_billion_google, and test_billion_google
# under the appropriate directory before executing this code.
train_billion_google, val_billion_google, test_billion_google = billion_google.splits(path = '/Users/dev/',
                                                                                      train = 'train_billion_google',
                                                                                      validation = 'val_billion_google',
                                                                                      test = 'test_billion_google', 
                                                                                      text_field = TEXT_billion_google)

Python indicates that the train_billion_google set in this case contains only 13 tokens, which is clearly wrong. What am I doing wrong here? I can't split billion_google!

zhangguanheng66 · 2019-11-19T16:43:00Z

Can you check the training file 'train_billion_google' ?

h56cho · 2019-11-19T16:46:09Z

Hello,

The file file train_billion_google is only 196 bytes large and when I open it, it's completely empty.

zhangguanheng66 · 2019-11-19T18:30:05Z

So you didn't download it correctly.

h56cho · 2019-11-19T18:41:40Z

Hello,

I thought that I am supposed to manually make the empty files named train_billion_google, val_billion_google, test_billion_google before executing the line of code below:

# loading Google 1 Billion Benchmark dataset
billion_google = LanguageModelingDataset(path = '/Users/dev/billion_google', 
                                         encoding = 'utf-8', 
                                         text_field = TEXT_billion_google)

# note: I created empty files train_billion_google, val_billion_google, and test_billion_google
# under the appropriate directory before executing this code.
train_billion_google, val_billion_google, test_billion_google = billion_google.splits(path = '/Users/dev/',
                                                                                      train = 'train_billion_google',
                                                                                      validation = 'val_billion_google',
                                                                                      test = 'test_billion_google', 
                                                                                      text_field = TEXT_billion_google)

my "billion_google" file is full of texts, so I don't think I downloaded it in a wrong way (it was a successful download). The files train_billion_google, val_billion_google, test_billion_google are empty because I artificially made those 3 empty files and placed under the folder /User/dev/, because I thought I was supposed to do that to run the lines of code that I typed above.

again, how can I make the splits function to work?

Thank you,

h56cho · 2019-11-19T18:46:36Z

I don't get how I can use the "new pattern" in TorchText to make splits function to work.

zhangguanheng66 · 2019-11-19T18:55:57Z

It quite confused me. To test your code, you should add "something" in train_billion_google file and see if it load the information as expected.

For the new pattern, it's not supposed to use splits. If you take a look at the PR I attached above, it should be pretty clear to put together a new dataset (a.k.a. billion_google).

h56cho · 2019-11-20T01:03:18Z

Thank you!

Your answer solved my issue.

h56cho closed this as completed Nov 19, 2019

h56cho reopened this Nov 19, 2019

h56cho closed this as completed Nov 20, 2019

zhangguanheng66 mentioned this issue Dec 6, 2019

Overview of issues in torchtext and the plan for revamping #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Google 1-billion benchmark data on PyTorch #644

Using Google 1-billion benchmark data on PyTorch #644

h56cho commented Nov 19, 2019 •

edited

Loading

bentrevett commented Nov 19, 2019

h56cho commented Nov 19, 2019 •

edited

Loading

zhangguanheng66 commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019

zhangguanheng66 commented Nov 19, 2019

h56cho commented Nov 19, 2019

h56cho commented Nov 19, 2019 •

edited

Loading

zhangguanheng66 commented Nov 19, 2019

h56cho commented Nov 19, 2019 •

edited

Loading

zhangguanheng66 commented Nov 19, 2019

h56cho commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019 •

edited

Loading

zhangguanheng66 commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 20, 2019

Using Google 1-billion benchmark data on PyTorch #644

Using Google 1-billion benchmark data on PyTorch #644

Comments

h56cho commented Nov 19, 2019 • edited Loading

bentrevett commented Nov 19, 2019

h56cho commented Nov 19, 2019 • edited Loading

zhangguanheng66 commented Nov 19, 2019 • edited Loading

h56cho commented Nov 19, 2019

zhangguanheng66 commented Nov 19, 2019

h56cho commented Nov 19, 2019

h56cho commented Nov 19, 2019 • edited Loading

zhangguanheng66 commented Nov 19, 2019

h56cho commented Nov 19, 2019 • edited Loading

zhangguanheng66 commented Nov 19, 2019

h56cho commented Nov 19, 2019 • edited Loading

h56cho commented Nov 19, 2019 • edited Loading

zhangguanheng66 commented Nov 19, 2019 • edited Loading

h56cho commented Nov 20, 2019

h56cho commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019 •

edited

Loading

zhangguanheng66 commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019 •

edited

Loading

h56cho commented Nov 19, 2019 •

edited

Loading

zhangguanheng66 commented Nov 19, 2019 •

edited

Loading